CLOct 29, 2025

Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA

arXiv:2510.25273v11 citationsh-index: 13
Originality Incremental advance
AI Analysis

This addresses the problem of limited annotated datasets and domain knowledge for low-resource language QA, though it is incremental as it builds on existing finetuning and data augmentation methods.

The paper tackled domain-specific question answering in low-resource languages like Hindi tourism by developing a multi-stage finetuning strategy using synthetic data from large LLMs, resulting in small models effectively adapting to it for scalable QA.

Domain-specific question answering in low-resource languages faces two key challenges: scarcity of annotated datasets and limited domain knowledge in general-purpose language models. In this work, we present a multi-stage finetuning strategy to adapt lightweight language models to the Hindi tourism domain by leveraging both original and synthetic training data. Synthetic question-answer pairs are generated using large LLMs (LLaMA-70B, Phi-14B) and used to augment the limited original dataset. We explore several training methodologies and analyse their impact on domain generalisation. Our results demonstrate that large models can efficiently generate synthetic data, while small models can effectively adapt to it, offering a scalable pathway for low-resource, domain-specific QA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes