OpenLLM-Ro -- Technical Report on Open-source Romanian LLMs
This addresses the need for better AI tools for Romanian speakers, but it is incremental as it applies existing methods to a new language.
The paper tackled the problem of poor performance of existing multilingual LLMs in Romanian by training the first foundational and chat LLM specialized for Romanian, resulting in a model designed to improve language-specific capabilities.
In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English. Hence, their performance in English greatly exceeds their performance in other languages. This document presents our approach to training and evaluating the first foundational and chat LLM specialized for Romanian.