CL AI LGMay 9, 2025

Full-Parameter Continual Pretraining of Gemma2: Insights into Fluency and Domain Knowledge

Vytenis Šliogeris, Povilas Daniušis, Artūras Nakvosas

arXiv:2505.05946v23 citationsh-index: 1Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of adapting general-purpose LLMs to under-represented languages efficiently for researchers and practitioners, though it is incremental as it applies existing methods (EWC) to a new language.

The authors enhanced the linguistic fluency of the Gemma2 LLM for Lithuanian by pretraining it on 10% of the Lithuanian CulturaX dataset while using Elastic Weight Consolidation (EWC) to prevent catastrophic forgetting of existing domain knowledge, resulting in preserved or improved performance on both fluency (perplexity) and domain knowledge benchmarks (e.g., MMLU, GSM8K) in English and Lithuanian.

In this technical report, we empirically investigate the relationship between linguistic fluency and domain knowledge in the context of continual learning with large language models (LLMs). Specifically, we enhance the linguistic fluency of the Gemma2 LLM for the Lithuanian language by autoregressively pretraining its full parameter set on the first 10\% of the Lithuanian language component of the CulturaX dataset. To prevent catastrophic forgetting of the model's existing domain knowledge, we apply Elastic Weight Consolidation (EWC), leveraging Fisher information estimated using data from the Massive Multitask Language Understanding (MMLU) benchmark. In the post-training evaluations, we assess linguistic fluency through perplexity and evaluate domain knowledge using accuracy on a suite of language understanding benchmarks, including ARC-Easy, Belebele, GSM8K, HellaSwag, MMLU, TruthfulQA, and Winogrande, in both English and Lithuanian. The empirical results demonstrate that EWC not only mitigates catastrophic forgetting by preserving the model's performance in terms of both linguistic fluency and domain knowledge but also improves or maintains these capabilities for the newly added Lithuanian language. These findings highlight the potential for more efficient adaptation of general-purpose LLMs to under-represented languages without requiring access to the original training data. The accompanying codebase is openly accessible at https://github.com/Neurotechnology/LLM_EWC.

View on arXiv PDF Code

Similar