AI CLJan 22

ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

Shir Ashury-Tahan, Yifan Mai, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen

IBM

arXiv:2601.15812v27.55 citationsh-index: 13

Originality Incremental advance

AI Analysis

This addresses the need for more nuanced evaluation to debug models and guide improvement in AI, though it is incremental as it builds on existing benchmarking.

The researchers tackled the problem of understanding why large language models fail, not just when, by introducing ErrorMap to extract failure signatures and ErrorAtlas to taxonomize errors, revealing underexplored patterns like omissions and misinterpretations across 35 datasets and 83 models.

Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail. A wrong answer on a reasoning dataset may stem from formatting issues, calculation errors, or dataset noise rather than weak reasoning. Without disentangling such causes, benchmarks remain incomplete and cannot reliably guide model improvement. We introduce ErrorMap, the first method to chart the sources of LLM failure. It extracts a model's unique "failure signature", clarifies what benchmarks measure, and broadens error identification to reduce blind spots. This helps developers debug models, aligns benchmark goals with outcomes, and supports informed model selection. ErrorMap works on any model or dataset with the same logic. Applying our method to 35 datasets and 83 models we generate ErrorAtlas, a taxonomy of model errors, revealing recurring failure patterns. ErrorAtlas highlights error types that are currently underexplored in LLM research, such as omissions of required details in the output and question misinterpretation. By shifting focus from where models succeed to why they fail, ErrorMap and ErrorAtlas enable advanced evaluation - one that exposes hidden weaknesses and directs progress. Unlike success, typically measured by task-level metrics, our approach introduces a deeper evaluation layer that can be applied globally across models and tasks, offering richer insights into model behavior and limitations. We make the taxonomy and code publicly available with plans to periodically update ErrorAtlas as new benchmarks and models emerge.

View on arXiv PDF

Similar