The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark

arXiv:2605.0990013.8

Predicted impact top 68% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers evaluating vision-language models, this benchmark exposes a critical failure in spatial reasoning that persists even with thinking-mode reasoning, highlighting a bottleneck for current VLMs.

KnotBench introduces a 858,318-image benchmark for diagrammatic knot reasoning, testing VLMs on 14 tasks across four families. No model achieves a strictly correct diagram-to-symbol transcription, and best scores on 8 of 14 tasks are under 1.5x random, revealing a fundamental perception-operation gap.

A vision-language model can look at a knot diagram and report what it sees, yet fail to act on that structure. KnotBench pairs an 858,318-image corpus from 1,951 prime-knot prototypes (crossing numbers 3 to 19) with a protocol whose answers are checked against Regina's canonical knot signature. Its 14 tasks span four families, equivalence judgment, move prediction, identification, and cross-modal grounding; an image-versus-symbol split locates failures along the perception-operation gap. We score Claude Opus 4.7 and GPT-5, each with and without thinking, under a 64K output-token budget matched on both vendors. Across 56 (task, model) cases, 15 sit at or below a random baseline and 8 of 14 tasks have a best score under 1.5x random. On diagram-to-symbol transcription, no model produces a strictly correct string, and permissive Regina decoding recovers the knot in 0 to 4 of 100 items. Thinking-mode reasoning lifts overall accuracy by 1.65 points for Claude and 9.25 points for GPT-5, narrowing the gap only modestly. Read together, the four families suggest current vision-language models hold features of a diagram but lack apparatus to simulate moves on those features.

View on arXiv PDF

Similar