CLJun 3, 2025

On Generalization across Measurement Systems: LLMs Entail More Test-Time Compute for Underrepresented Cultures

arXiv:2506.02591v13 citationsh-index: 16Has CodeACL
Originality Incremental advance
AI Analysis

This addresses fairness and accessibility issues for users from diverse cultural backgrounds in AI systems, though it is incremental in highlighting a specific bias.

The study found that large language models (LLMs) default to measurement systems prevalent in their training data, leading to performance instability across different systems, and while reasoning methods like chain-of-thought can mitigate this, they increase test-time compute, disproportionately affecting users from underrepresented cultures.

Measurement systems (e.g., currencies) differ across cultures, but the conversions between them are well defined so that humans can state facts using any measurement system of their choice. Being available to users from diverse cultural backgrounds, large language models (LLMs) should also be able to provide accurate information irrespective of the measurement system at hand. Using newly compiled datasets we test if this is the case for seven open-source LLMs, addressing three key research questions: (RQ1) What is the default system used by LLMs for each type of measurement? (RQ2) Do LLMs' answers and their accuracy vary across different measurement systems? (RQ3) Can LLMs mitigate potential challenges w.r.t. underrepresented systems via reasoning? Our findings show that LLMs default to the measurement system predominantly used in the data. Additionally, we observe considerable instability and variance in performance across different measurement systems. While this instability can in part be mitigated by employing reasoning methods such as chain-of-thought (CoT), this implies longer responses and thereby significantly increases test-time compute (and inference costs), marginalizing users from cultural backgrounds that use underrepresented measurement systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes