AI & ML Breaks Assumption

Reveals that parallel translated data is surprisingly unnecessary for creating aligned multilingual representations in LLMs.

April 1, 2026

Original Paper

On the limited utility of parallel data for learning shared multilingual representations

Julius Leino, Jörg Tiedemann

arXiv · 2603.29026

The Takeaway

Conventional wisdom suggests parallel data is the primary signal for cross-lingual alignment, but this study shows alignment emerges naturally during pretraining. This shifts the focus for low-resource language modeling away from scarce translation pairs toward better monolingual data.

From the abstract

Shared multilingual representations are essential for cross-lingual tasks and knowledge transfer across languages. This study looks at the impact of parallel data, i.e. translated sentences, in pretraining as a signal to trigger representations that are aligned across languages. We train reference models with different proportions of parallel data and show that parallel data seem to have only a minimal effect on the cross-lingual alignment. Based on multiple evaluation methods, we find that the