CLIRFeb 27, 2025

Few-Shot Multilingual Open-Domain QA from 5 Examples

arXiv:2502.19722v1h-index: 3
Originality Highly original
AI Analysis

This work addresses the high annotation cost for underrepresented languages in MLODQA, offering a general solution that reduces reliance on large-scale labeled data.

The authors tackled the problem of multilingual open-domain question answering (MLODQA) with limited training data for underrepresented languages by introducing a few-shot learning approach that synthesizes data using large language models (LLMs). Their model, FsModQA, significantly outperformed existing baselines in MLODQA and retrieval tasks, and they demonstrated zero-shot adaptation to new languages with only English-supervised data.

Recent approaches to multilingual open-domain question answering (MLODQA) have achieved promising results given abundant language-specific training data. However, the considerable annotation cost limits the application of these methods for underrepresented languages. We introduce a \emph{few-shot learning} approach to synthesise large-scale multilingual data from large language models (LLMs). Our method begins with large-scale self-supervised pre-training using WikiData, followed by training on high-quality synthetic multilingual data generated by prompting LLMs with few-shot supervision. The final model, \textsc{FsModQA}, significantly outperforms existing few-shot and supervised baselines in MLODQA and cross-lingual and monolingual retrieval. We further show our method can be extended for effective zero-shot adaptation to new languages through a \emph{cross-lingual prompting} strategy with only English-supervised data, making it a general and applicable solution for MLODQA tasks without costly large-scale annotation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes