CLOct 20, 2025

Qomhra: A Bilingual Irish-English Large Language Model

arXiv:2510.17652v1h-index: 17
Originality Incremental advance
AI Analysis

This work addresses the problem of limited language resources for Irish speakers and learners, representing an incremental advancement in bilingual LLMs for low-resource languages.

The paper tackles the challenge of developing a bilingual Irish-English large language model under low-resource constraints, resulting in Qomhrá, which shows gains of up to 29% in Irish and 44% in English across various benchmarks.

This paper introduces Qomhrá, a bilingual Irish-English large language model (LLM), developed under low-resource constraints presenting a complete pipeline spanning bilingual continued pre-training, instruction tuning, and alignment from human preferences. Newly accessible Irish corpora and English text are mixed and curated to improve Irish performance while preserving English ability. 6 closed-weight LLMs are judged for their Irish text generation by a native speaker, a learner and other LLMs. Google's Gemini-2.5-Pro is ranked the highest and is subsequently used to synthesise instruction tuning and human preference datasets. Two datasets are contributed leveraging Gemini-2.5-Pro: a 30K Irish-English parallel instruction tuning dataset and a 1K human preference dataset, generating accepted and rejected responses that show near perfect alignment with a native Irish speaker. Qomhrá is comprehensively evaluated across benchmarks testing translation, gender understanding, topic identification and world knowledge with gains of up to 29% in Irish and 44% in English. Qomhrá also undergoes instruction tuning and demonstrates clear progress in instruction following, crucial for chatbot functionality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes