Assessing Agentic Large Language Models in Multilingual National Bias
This addresses the risk of cross-language disparities in AI recommendations for users in multilingual contexts, but it is incremental as it builds on existing bias studies.
The study tackled the problem of multilingual bias in large language models by testing their decision-making across languages in scenarios like university applications and travel, finding that local language bias is prevalent and models like GPT-4 reduce bias for English-speaking countries but fail to achieve robust multilingual alignment.
Large Language Models have garnered significant attention for their capabilities in multilingual natural language processing, while studies on risks associated with cross biases are limited to immediate context preferences. Cross-language disparities in reasoning-based recommendations remain largely unexplored, with a lack of even descriptive analysis. This study is the first to address this gap. We test LLM's applicability and capability in providing personalized advice across three key scenarios: university applications, travel, and relocation. We investigate multilingual bias in state-of-the-art LLMs by analyzing their responses to decision-making tasks across multiple languages. We quantify bias in model-generated scores and assess the impact of demographic factors and reasoning strategies (e.g., Chain-of-Thought prompting) on bias patterns. Our findings reveal that local language bias is prevalent across different tasks, with GPT-4 and Sonnet reducing bias for English-speaking countries compared to GPT-3.5 but failing to achieve robust multilingual alignment, highlighting broader implications for multilingual AI agents and applications such as education. \footnote{Code available at: https://github.com/yiyunya/assess_agentic_national_bias