CVLGMay 13

CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

arXiv:2605.1406832.8
Predicted impact top 84% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This benchmark provides a challenging test for exact topological reasoning in vision-language models, revealing a significant gap in current AI capabilities.

CurveBench introduces a benchmark for hierarchical topological reasoning from visual input, consisting of 756 images of nested Jordan curves. The best model achieves 71.1% accuracy on easy and 19.1% on hard subsets, while fine-tuning improves accuracy from 2.8% to 33.3% on easy, showing the task is far from solved.

We introduce CurveBench, a benchmark for hierarchical topological reasoning from visual input. CurveBench consists of \textbf{756 images} of pairwise non-intersecting Jordan curves across easy, polygonal, topographic-inspired, maze-like, and dense counting configurations. Each image is annotated with a rooted tree encoding the containment relations between planar regions. We formulate the task as structured prediction: given an image, a model must recover the full rooted containment tree induced by the curves. Despite the visual simplicity of the task, the strongest evaluated model, Gemini 3.1 Pro, achieves only \textbf{71.1\%} tree-generation accuracy on CurveBench-Easy and \textbf{19.1\%} on CurveBench-Hard. We further demonstrate benchmark utility through RLVR-style fine-tuning of open-weight vision-language models. Our trained Qwen3-VL-8B model improves over \texttt{Qwen-3-VL-8B-Thinking} from \textbf{2.8\%} to \textbf{33.3\%} tree-generation accuracy on CurveBench-Easy, exceeding GPT-5.4 and Claude Opus 4.5 under our evaluation protocol. The remaining gap, especially on CurveBench-Hard, shows that exact topology-aware visual reasoning remains far from solved.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes