CVOct 30, 2025

Masked Diffusion Captioning for Visual Feature Learning

arXiv:2510.26799v11 citationsh-index: 1EMNLP
Originality Incremental advance
AI Analysis

This addresses visual representation learning for computer vision tasks, but appears incremental as it builds on existing captioning and diffusion methods.

The paper tackles visual feature learning by captioning images with a masked diffusion language model, where text tokens are randomly masked and reconstructed. The resulting visual features are competitive with autoregressive and contrastive approaches in linear probing experiments across various models and datasets.

We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token's position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes