CLAug 29, 2025

The Rarity Blind Spot: A Framework for Evaluating Statistical Reasoning in LLMs

Seiji Maekawa, Hayate Iso, Nikita Bhutani

NVIDIA

arXiv:2509.00245v22.7h-index: 11

Originality Incremental advance

AI Analysis

This addresses a gap in LLM evaluation for statistical reasoning in real-world scenarios like candidate selection, though it is incremental as it builds on existing benchmarking efforts.

The paper tackles the problem of evaluating LLMs' ability to identify globally distinctive features across document sets, introducing the Distinctive Feature Mining (DFM) task and DiFBench framework, and finds that models degrade significantly with increased complexity and often misidentify frequent features as distinctive.

Effective decision-making often relies on identifying what makes each candidate distinctive. While existing benchmarks for LLMs emphasize retrieving or summarizing information relevant to a given query, they do not evaluate a model's ability to identify globally distinctive features across a set of documents. We introduce Distinctive Feature Mining (DFM), a new task that challenges models to analyze a small-to-medium collection (10-40 documents) and surface features that are rare in the global context (e.g., appearing in less than 10% of documents). This setting mirrors real-world scenarios such as candidate selection or product differentiation, where statistical reasoning, not retrieval, is key. To enable systematic evaluation of this capability, we present DiFBench, a configurable benchmark creation framework with controllable parameters such as document set size and distinctiveness thresholds. Using DiFBench, we perform a large-scale assessment of distinctive feature mining across ten state-of-the-art LLMs. Our findings reveal a significant performance gap between general-purpose and reasoning-enhanced models. All models, however, substantially degrade as the task complexity and document count increase. We also find that a common failure mode is misidentifying frequent features as distinctive. These insights reveal core limitations in contemporary LLMs' abilities to perform fine-grained, statistical reasoning and rarity detection.

View on arXiv PDF

Similar