CLAIApr 3

Measuring Representation Robustness in Large Language Models for Geometry

arXiv:2604.1642151.9h-index: 4Has Code
Predicted impact top 84% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For researchers and practitioners evaluating LLMs on mathematical reasoning, this work reveals that current benchmarks overestimate robustness by ignoring representation sensitivity, and provides a framework to measure it.

LLMs exhibit accuracy drops of up to 14 percentage points solely due to representation choice in geometry problems, with vector formulations being a consistent failure point. A convert-then-solve intervention improves vector accuracy by up to 52 percentage points for high-capacity models but not for low-capacity ones.

Large language models (LLMs) are increasingly evaluated on mathematical reasoning, yet their robustness to equivalent problem representations remains poorly understood. In geometry, identical problems can be expressed in Euclidean, coordinate, or vector forms, but existing benchmarks report accuracy on fixed formats, implicitly assuming representation invariance and masking failures caused by representational changes alone. We propose GeoRepEval, a representation-aware evaluation framework that measures correctness, invariance, and consistency at the problem level across parallel formulations, combining strict answer matching, bootstrap confidence intervals, paired McNemar tests, representation-flip analyses, and regression controls for surface complexity. We prove that our Invariance@3 metric decomposes accuracy into robust and fragile components and is bounded by the weakest representation. Evaluating eleven LLMs on 158 curated high-school geometry problems (474 instances), we find accuracy gaps of up to 14 percentage points induced solely by representation choice. Vector formulations emerge as a consistent failure point, with Invariance@3 as low as 0.044 even after controlling for length and symbolic complexity. A convert-then-solve prompting intervention improves vector accuracy by up to 52 percentage points for high-capacity models, suggesting that failures reflect representation sensitivity rather than inability; however, low-capacity models show no gains, indicating deeper limitations. These results suggest that current models rely on representation-specific heuristics rather than abstract geometric reasoning. All datasets, prompts, and scripts are released at https://github.com/vedjaw/GeoRepEval.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes