The Quadratic Geometry of Flow Matching: Semantic Granularity Alignment for Text-to-Image Synthesis
This addresses optimization inefficiencies in text-to-image synthesis, representing an incremental improvement to existing methods.
The paper tackles the problem of inefficient optimization in generative fine-tuning by analyzing Flow Matching dynamics, revealing that standard training lacks explicit control over feature correlations. The proposed Semantic Granularity Alignment method accelerates convergence and improves structural integrity in text-to-image synthesis across DiT and U-Net architectures.
In this work, we analyze the optimization dynamics of generative fine-tuning. We observe that under the Flow Matching framework, the standard MSE objective can be formulated as a Quadratic Form governed by a dynamically evolving Neural Tangent Kernel (NTK). This geometric perspective reveals a latent Data Interaction Matrix, where diagonal terms represent independent sample learning and off-diagonal terms encode residual correlation between heterogeneous features. Although standard training implicitly optimizes these cross-term interferences, it does so without explicit control; moreover, the prevailing data-homogeneity assumption may constrain the model's effective capacity. Motivated by this insight, we propose Semantic Granularity Alignment (SGA), using Text-to-Image synthesis as a testbed. SGA engineers targeted interventions in the vector residual field to mitigate gradient conflicts. Evaluations across DiT and U-Net architectures confirm that SGA advances the efficiency-quality trade-off by accelerating convergence and improving structural integrity.