CLSep 26, 2024

Evaluating Multilingual Long-Context Models for Retrieval and Reasoning

arXiv:2409.18006v328 citationsh-index: 4
AI Analysis

This work identifies challenges for multilingual long-context LLMs, particularly with longer contexts, multiple targets, and low-resource languages, which is important for researchers and developers working on cross-lingual AI applications.

The paper investigated how multilingual long-context LLMs perform on retrieval and reasoning tasks across five languages, revealing significant performance gaps: accuracy dropped from 96% in English to 36% in Somali with one target sentence, and further to 40% in English and 0% in Somali with three target sentences.

Recent large language models (LLMs) demonstrate impressive capabilities in handling long contexts, some exhibiting near-perfect recall on synthetic retrieval tasks. However, these evaluations have mainly focused on English text and involved a single target sentence within lengthy contexts. Our work investigates how LLM performance generalizes to multilingual settings with multiple hidden target sentences. We create a new dataset -- mLongRR -- to comprehensively evaluate several multilingual long-context LLMs on retrieval and reasoning tasks across five languages: English, Vietnamese, Indonesian, Swahili, and Somali. These languages share the Latin script but belong to distinct language families and resource levels. Our analysis reveals a significant performance gap between languages. The best-performing models such as Gemini-1.5 and GPT-4o, achieve around 96% accuracy in English to around 36% in Somali with a single target sentence. However, this accuracy drops to 40% in English and 0% in Somali when dealing with three target sentences. Our findings highlight the challenges long-context LLMs face when processing longer contexts, an increase in the number of target sentences, or languages of lower resource levels.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes