Towards Geo-Culturally Grounded LLM Generations
This work addresses the issue of cultural bias in LLMs for global users, but it is incremental as it builds on existing retrieval-augmented methods without introducing a new paradigm.
The study tackled the problem of gaps in cultural awareness in large language models (LLMs) by comparing retrieval-augmented generation techniques, finding that search grounding significantly improved performance on multiple-choice benchmarks testing propositional cultural knowledge, but increased risks of stereotypical judgments and failed to enhance cultural familiarity in human evaluations.
Generative large language models (LLMs) have demonstrated gaps in diverse cultural awareness across the globe. We investigate the effect of retrieval augmented generation and search-grounding techniques on LLMs' ability to display familiarity with various national cultures. Specifically, we compare the performance of standard LLMs, LLMs augmented with retrievals from a bespoke knowledge base (i.e., KB grounding), and LLMs augmented with retrievals from a web search (i.e., search grounding) on multiple cultural awareness benchmarks. We find that search grounding significantly improves the LLM performance on multiple-choice benchmarks that test propositional knowledge (e.g., cultural norms, artifacts, and institutions), while KB grounding's effectiveness is limited by inadequate knowledge base coverage and a suboptimal retriever. However, search grounding also increases the risk of stereotypical judgments by language models and fails to improve evaluators' judgments of cultural familiarity in a human evaluation with adequate statistical power. These results highlight the distinction between propositional cultural knowledge and open-ended cultural fluency when it comes to evaluating LLMs' cultural awareness.