Reveals that parallel translated data is surprisingly unnecessary for creating aligned multilingual representations in LLMs.
April 1, 2026
Original Paper
On the limited utility of parallel data for learning shared multilingual representations
arXiv · 2603.29026
The Takeaway
Conventional wisdom suggests parallel data is the primary signal for cross-lingual alignment, but this study shows alignment emerges naturally during pretraining. This shifts the focus for low-resource language modeling away from scarce translation pairs toward better monolingual data.
From the abstract
Shared multilingual representations are essential for cross-lingual tasks and knowledge transfer across languages. This study looks at the impact of parallel data, i.e. translated sentences, in pretraining as a signal to trigger representations that are aligned across languages. We train reference models with different proportions of parallel data and show that parallel data seem to have only a minimal effect on the cross-lingual alignment. Based on multiple evaluation methods, we find that the