CVCLLGDec 3, 2025

Optical Context Compression Is Just (Bad) Autoencoding

arXiv:2512.03643v11 citationsh-index: 5Has Code
Originality Synthesis-oriented
AI Analysis

This work is an incremental critique for researchers in multimodal AI, highlighting that current evidence for optical context compression may be overstated.

The paper challenges the assumption that vision-based context compression improves language modeling by showing that simple alternatives like mean pooling match or exceed vision encoders in text reconstruction and outperform them in language modeling tasks.

DeepSeek-OCR demonstrates that rendered text can be reconstructed with high fidelity from a small number of vision tokens. This finding has sparked excitement about vision-based context compression for language models. But the evaluation stops at reconstruction; whether these representations help language modeling remains untested. We test two assumptions implicit in the optical-compression narrative: that vision-based compression provides unique advantages for text reconstruction from compressed representations, and that DeepSeek-OCR's reconstruction results are evidence that vision-based compression will be useful for language modeling. Comparing their vision encoder against simple alternatives--parameter-free mean pooling and a learned hierarchical encoder--we find that these simple approaches match or surpass vision for reconstruction at matched compression ratios, and outperform it for language modeling--where vision-based compression fails to beat truncation. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at https://github.com/ivnle/bad-autoencoding

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes