CL AIMay 24

Mimir: Large-scale Multilingual Concept Modeling

Elio Musacchio, Lucia Siciliani, Pierpaolo Basile

arXiv:2605.2526396.1

AI Analysis

This work proposes a new paradigm for language modeling by shifting from token-level to concept-level prediction, aiming to improve meaning understanding and generation for multilingual NLP.

Mimir introduces a 1.6B parameter Large Concept Model trained for next-concept prediction instead of next-token prediction, using a multilingual corpus of 38.9B sentences across 46 languages and instruction-tuning data of 66.8M sentences across 35 languages, achieving competitive performance against a comparable-sized language model.

Current language modeling approaches are built around tokens. Text corpora are split into tokens, and models are trained by performing computations on these tokens, such as predicting the next token given the preceding ones as context. This paradigm has become the standard in modern language modeling, especially given the outstanding performance obtained by token-based architectures. However, recent works have not only begun to question how language models process and understand meaning from tokens, but also to question whether using higher levels of granularity could advance the research field. This led to the idea of Concept Modeling, that is, to directly train models for next-concept prediction rather than next-token prediction. The goal is to change the input from tokens to concepts, forcing the underlying language model to shift its granularity from fine-grained tokens to broad concepts. In this work, we introduce Mimir, a 1.6B Large Concept Model trained for multilingual concept understanding and generation. We leverage a large-scale multilingual pre-training corpus (38,883,987,240 sentences) spanning 46 languages and a large-scale multi-turn and multilingual instruction-tuning dataset (66,816,428 sentences) covering a total of 35 languages. We extensively evaluate model performance against a language model with a comparable number of parameters.

View on arXiv PDF

Similar