New Encoders for German Trained from Scratch: Comparing ModernGBERT with Converted LLM2Vec Models
This provides actionable guidance for German NLP practitioners on encoder selection based on parameter efficiency and compute constraints.
This work tackled the problem of creating high-quality German encoders by comparing training from scratch (ModernGBERT) with converting decoders (LLäMmleinVec) under identical constraints, resulting in ModernGBERT 1B achieving a new state of the art on SuperGLEBer (avg 0.808) and competitive performance on German MTEB (0.551).
Encoders remain essential for efficient German NLP and NLU scenarios despite the rise of decoder-only LLMs. This work studies two routes to high-quality German encoders under identical data and training constraints: 1) training from scratch and 2) converting decoders via LLM2Vec. We introduce two resources: ModernGBERT (134M, 1B), fully transparent German encoders in the ModernBERT style, and LLäMmleinVec (120M, 1B, 7B), decoder-to-encoder conversions trained with masked next-token prediction, both undergoing a context extension to 8.192 tokens. Across SuperGLEBer, ModernGBERT 1B sets a new state of the art (avg 0.808), surpassing GBERT Large (+4%) and the seven-times larger converted 7B model (0.787). On German MTEB after supervised fine-tuning, ModernGBERT 1B (0.551) approaches the converted 7B model (0.557). We release all models, checkpoints, datasets, and full training records, and introduce an encoder-adapted QA-NIAH evaluation. All in all, our results provide actionable guidance: when parameter efficiency and latency matter, from-scratch encoders dominate. When a pre-trained decoder exists and compute is a limited, conversion offers an effective alternative. ModernGBERT and LLäMmleinVec, including all code, data and intermediary checkpoints are published under a research-only RAIL license.