Pretraining Finnish ModernBERTs
This work addresses the need for efficient language models tailored to Finnish and its linguistic context, though it is incremental as it builds on existing ModernBERT architectures.
The paper tackled the problem of pretraining ModernBERT encoder models for Finnish and related languages, achieving competitive or superior performance compared to existing multilingual models and outperforming monolingual models on tasks requiring contexts longer than 512 tokens.
This paper reports on pretraining ModernBERT encoder models in six different sizes, ranging from 51M to 475M parameters, with a focus on limited multilingualism, emphasizing languages relevant to Finland. Our models are competitive with, or superior to, existing multilingual models. They outperform monolingual models on tasks that require a context longer than 512 tokens. We present empirical results on using different data in the final stage of training. The code and models are publicly released.