CVMar 6

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

arXiv:2603.06507v15 citations
Predicted impact top 1% in CV · last 90 daysOriginality Highly original
AI Analysis

This addresses the need for scalable and efficient multi-modal synthesis without reliance on external models, though it is incremental in combining representation learning with flow matching.

The paper tackles the problem of improving generative models by integrating semantic representation learning directly into the training process, achieving superior image, video, and audio generation with expected scaling laws.

Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on external models, which require separate training, operate on misaligned objectives, and exhibit unexpected scaling behavior. We argue that this dependence arises from the model's training objective, which poses a denoising task with little incentive to learn semantic representations. We introduce Self-Flow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework. Our key mechanism, Dual-Timestep Scheduling, applies heterogeneous noise levels across tokens, creating an information asymmetry that forces the model to infer missing information from corrupted inputs. This drives learning strong representations alongside generative capabilities without external supervision. Our method generalizes across modalities and enables multi-modal training while following expected scaling laws, achieving superior image, video, and audio generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes