The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis
This work addresses the under-explored problem of multilingual in-context learning for researchers, revealing that demonstration importance may be overestimated, which is incremental as it builds on existing English-focused studies.
The study analyzed multilingual in-context learning across 5 models, 9 datasets, and 56 languages, finding that demonstration effectiveness varies significantly and that strong models like Llama 2-Chat and GPT-4 are often insensitive to demonstration quality, with templates sometimes eliminating their benefits.
In-context learning is a popular inference strategy where large language models solve a task using only a few labeled demonstrations without needing any parameter updates. Although there have been extensive studies on English in-context learning, multilingual in-context learning remains under-explored, and we lack an in-depth understanding of the role of demonstrations in this context. To address this gap, we conduct a multidimensional analysis of multilingual in-context learning, experimenting with 5 models from different model families, 9 datasets covering classification and generation tasks, and 56 typologically diverse languages. Our results reveal that the effectiveness of demonstrations varies significantly across models, tasks, and languages. We also find that strong instruction-following models including Llama 2-Chat, GPT-3.5, and GPT-4 are largely insensitive to the quality of demonstrations. Instead, a carefully crafted template often eliminates the benefits of demonstrations for some tasks and languages altogether. These findings show that the importance of demonstrations might be overestimated. Our work highlights the need for granular evaluation across multiple axes towards a better understanding of in-context learning.