CV AIMar 6

Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement

arXiv:2603.06459v1

Predicted impact top 57% in CV · last 90 daysOriginality Highly original

AI Analysis

This research is significant for developers and researchers working with vision-language models, demonstrating that these models implicitly encode rich geometric data that can be extracted via lightweight probes, enabling multi-task geometric sensing without fine-tuning or text generation.

This paper investigates whether vision-language models encode continuous geometric information, finding that a 6,000-parameter linear probe can extract hand joint angles with 6.1 degrees MAE from frozen features, significantly outperforming text output (20.0 degrees MAE). The study also reveals that different encoder architectures achieve statistically equivalent accuracy (R^2 approx. 0.55) despite varying representational similarity, and that autoregressive generation damages geometric fidelity, though LLM layers can improve probe accuracy.

Vision-language models encode continuous geometry that their text pathway fails to express: a 6,000-parameter linear probe extracts hand joint angles at 6.1 degrees MAE from frozen features, while the best text output achieves only 20.0 degrees -- a 3.3x bottleneck. LoRA fine-tuning (r=16, 2,000 images) narrows this gap to 6.5 degrees, providing evidence for a pathway-training deficit rather than a representational one. Training objective determines accuracy more than architecture: five encoders spanning self-supervised, contrastive, and hybrid paradigms converge to statistically equivalent accuracy (R^2 approximately 0.55, TOST-equivalent at delta=0.03) despite sharing as little as CKA=0.41 representational similarity -- functional convergence without representational convergence. Autoregressive generation damages geometric fidelity, but the damage originates in the generation process, not in language alignment: Qwen2.5-VL's LLM layers actually improve probe accuracy over its raw vision encoder. Layer-wise analysis reveals a universal mid-network accuracy peak across all architectures, with attention heads in layers 18-22 carrying disproportionate geometric signal. These findings enable a single frozen backbone to function as a multi-task geometric sensor through lightweight probes, without fine-tuning or text generation.

View on arXiv PDF

Similar