CLAug 7, 2025

SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens

Nikita Dragunov, Temurbek Rahmatullaev, Elizaveta Goncharova, Andrey Kuznetsov, Anton Razzhigaev

arXiv:2508.05305v12 citationsh-index: 7

Originality Incremental advance

AI Analysis

This work addresses text generation for natural language processing by introducing an incremental improvement over existing large concept models.

The paper tackled the problem of generating text by predicting sentence-level embeddings, proposing SONAR-LLM, a decoder-only transformer that uses a hybrid objective to combine semantic abstraction with token-level supervision, resulting in competitive generation quality across model sizes from 39M to 1.3B parameters.

The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that "thinks" in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.

View on arXiv PDF

Similar