CLAILGDec 16, 2025

Parameter Efficient Multimodal Instruction Tuning for Romanian Vision Language Models

arXiv:2512.14926v1h-index: 13Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of low-resource language support in multimodal AI for Romanian speakers, though it is incremental as it applies existing methods to a new language dataset.

The authors tackled the multimodal NLP resource gap for Romanian by translating the Flickr30k dataset into Romanian and extending it for visual question answering, then fine-tuning open-source vision-language models using LoRA, resulting in improved Romanian capabilities with a 7B-parameter model achieving up to +6.05% BERTScore F1 gains and reduced grammatical errors.

Focusing on low-resource languages is an essential step toward democratizing generative AI. In this work, we contribute to reducing the multimodal NLP resource gap for Romanian. We translate the widely known Flickr30k dataset into Romanian and further extend it for visual question answering by leveraging open-source LLMs. We demonstrate the usefulness of our datasets by fine-tuning open-source VLMs on Romanian visual question answering. We select VLMs from three widely used model families: LLaMA 3.2, LLaVA 1.6, and Qwen2. For fine-tuning, we employ the parameter-efficient LoRA method. Our models show improved Romanian capabilities in visual QA, as well as on tasks they were not trained on, such as Romanian image description generation. The seven-billion-parameter Qwen2-VL-RoVQA obtains top scores on both tasks, with improvements of +6.05% and +2.61% in BERTScore F1 over its original version. Finally, the models show substantial reductions in grammatical errors compared to their original forms, indicating improvements not only in language understanding but also in Romanian fluency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes