CL LGMay 19

Language models struggle with compartmentalization

arXiv:2605.1928452.7

AI Analysis

Identifies a fundamental limitation of language models for cross-representation learning, relevant to multilingual and multi-modal AI systems.

LLMs fail to share statistical strength across different presentations of the same concept (e.g., English vs. Swahili, Python vs. Haskell), leading to redundant internal representations and decreased sample efficiency. This compartmentalization persists even with synthetic parallel data and shows a phase transition where intervention effectiveness depends on the number of presentations.

In the training data used by large language models (LLMs), the same latent concept is often presented in multiple distinct ways: the same facts appear in English and Swahili; many functions can be expressed in both Python and Haskell; we can express propositions in both formal and natural language. We show that LLMs can exhibit compartmentalization, where they fail to identify and share statistical strength between distinct presentations of unified concepts. In the worst case, LLMs simply learn parallel internal representations of each presentation of the concept, saturating model capacity with redundancies and decreasing sample efficiency with the number of such presentations. We also demonstrate that synthetic parallel data can fail to improve this despite being easily learned itself. Under this framework, we find that, for small models, early multilingual learning is nearly entirely compartmentalized. Finally, all interventions that we study exhibit a phase transition in which their effectiveness depends on the number of distinct presentations, suggesting that the language modeling objective may only inconsistently unify representations.

View on arXiv PDF

Similar