Safer in Translation? Presupposition Robustness in Indic Languages
This addresses the gap in multilingual LLM evaluation for healthcare advice, particularly for Indic language speakers, but is incremental as it builds on an existing English benchmark.
The authors tackled the lack of multilingual benchmarks for evaluating LLMs in healthcare by creating Cancer-Myth-Indic, a translated benchmark of 500 items into five Indic languages, totaling 2,500 items, to test LLM responses to false presuppositions about cancer.
Increasingly, more and more people are turning to large language models (LLMs) for healthcare advice and consultation, making it important to gauge the efficacy and accuracy of the responses of LLMs to such queries. While there are pre-existing medical benchmarks literature which seeks to accomplish this very task, these benchmarks are almost universally in English, which has led to a notable gap in existing literature pertaining to multilingual LLM evaluation. Within this work, we seek to aid in addressing this gap with Cancer-Myth-Indic, an Indic language benchmark built by translating a 500-item subset of Cancer-Myth, sampled evenly across its original categories, into five under-served but widely used languages from the subcontinent (500 per language; 2,500 translated items total). Native-speaker translators followed a style guide for preserving implicit presuppositions in translation; items feature false presuppositions relating to cancer. We evaluate several popular LLMs under this presupposition stress.