CLAICVNov 21, 2025

Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables

arXiv:2511.17238v11 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the problem of overestimating VLM robustness for practitioners by exposing failure modes in multilingual and noisy table scenarios, though it is incremental as it focuses on benchmarking.

The paper tackles the gap between VLM benchmarks and real-world table reasoning by introducing MirageTVQA, a multilingual and visually noisy dataset, revealing a 35% performance drop in leading VLMs due to noise and English-first bias.

The impressive performance of VLMs is largely measured on benchmarks that fail to capture the complexities of real-world scenarios. Existing datasets for tabular QA, such as WikiTableQuestions and FinQA, are overwhelmingly monolingual (English) and present tables in a digitally perfect, clean format. This creates a significant gap between research and practice. To address this, we present \textbf{MirageTVQA}, a new benchmark designed to evaluate VLMs on these exact dimensions. Featuring nearly 60,000 QA pairs across 24 languages, MirageTVQA challenges models with tables that are not only multilingual but also visually imperfect, incorporating realistic noise to mimic scanned documents. Our evaluation of the leading VLMs reveals two primary failure points: a severe degradation in performance (over 35\% drop for the best models) when faced with visual noise and a consistent English-first bias where reasoning abilities fail to transfer to other languages. MirageTVQA provides a benchmark for measuring and driving progress towards more robust VLM models for table reasoning. The dataset and the code are available at: https://github.com/anshulsc/MirageTVQA.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes