Small Languages, Big Models: A Study of Continual Training on Languages of Norway
This addresses data scarcity for less widely spoken languages, enabling better AI tools for speakers of Norwegian Bokmål, Nynorsk, and Northern Sámi, though it is incremental as it builds on existing continual training methods.
The paper tackles the challenge of training large language models for low-resource languages like Norwegian and Northern Sámi by introducing a three-stage continual training approach, resulting in NorMistral-11B, an 11.4 billion parameter model that improves downstream performance and inference efficiency.
Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern Sámi. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.