CLOct 20, 2025

Qomhra: A Bilingual Irish-English Large Language Model

arXiv:2510.17652v1h-index: 17

Originality Incremental advance

AI Analysis

This work addresses the problem of limited language resources for Irish speakers and learners, representing an incremental advancement in bilingual LLMs for low-resource languages.

The paper tackles the challenge of developing a bilingual Irish-English large language model under low-resource constraints, resulting in Qomhrá, which shows gains of up to 29% in Irish and 44% in English across various benchmarks.

This paper introduces Qomhrá, a bilingual Irish-English large language model (LLM), developed under low-resource constraints presenting a complete pipeline spanning bilingual continued pre-training, instruction tuning, and alignment from human preferences. Newly accessible Irish corpora and English text are mixed and curated to improve Irish performance while preserving English ability. 6 closed-weight LLMs are judged for their Irish text generation by a native speaker, a learner and other LLMs. Google's Gemini-2.5-Pro is ranked the highest and is subsequently used to synthesise instruction tuning and human preference datasets. Two datasets are contributed leveraging Gemini-2.5-Pro: a 30K Irish-English parallel instruction tuning dataset and a 1K human preference dataset, generating accepted and rejected responses that show near perfect alignment with a native Irish speaker. Qomhrá is comprehensively evaluated across benchmarks testing translation, gender understanding, topic identification and world knowledge with gains of up to 29% in Irish and 44% in English. Qomhrá also undergoes instruction tuning and demonstrates clear progress in instruction following, crucial for chatbot functionality.

View on arXiv PDF

Similar