LGOct 2, 2025

Fine-Tuning Flow Matching via Maximum Likelihood Estimation of Reconstructions

arXiv:2510.02081v1h-index: 24
Originality Incremental advance
AI Analysis

This work solves a specific bottleneck in Flow Matching algorithms for applications requiring high precision, such as robotic manipulation, though it appears incremental as it builds directly on existing FM foundations.

The paper addresses the train-inference gap in Flow Matching (FM) algorithms, which prevents assessment of model outputs during training and can cause stiffness issues in precision-demanding applications like robotic manipulation, by proposing a fine-tuning method via Maximum Likelihood Estimation of reconstructions that improves inference performance in image generation and robotic manipulation tasks.

Flow Matching (FM) algorithm achieves remarkable results in generative tasks especially in robotic manipulation. Building upon the foundations of diffusion models, the simulation-free paradigm of FM enables simple and efficient training, but inherently introduces a train-inference gap. Specifically, we cannot assess the model's output during the training phase. In contrast, other generative models including Variational Autoencoder (VAE), Normalizing Flow and Generative Adversarial Networks (GANs) directly optimize on the reconstruction loss. Such a gap is particularly evident in scenarios that demand high precision, such as robotic manipulation. Moreover, we show that FM's over-pursuit of straight predefined paths may introduce some serious problems such as stiffness into the system. These motivate us to fine-tune FM via Maximum Likelihood Estimation of reconstructions - an approach made feasible by FM's underlying smooth ODE formulation, in contrast to the stochastic differential equations (SDEs) used in diffusion models. This paper first theoretically analyzes the relation between training loss and inference error in FM. Then we propose a method of fine-tuning FM via Maximum Likelihood Estimation of reconstructions, which includes both straightforward fine-tuning and residual-based fine-tuning approaches. Furthermore, through specifically designed architectures, the residual-based fine-tuning can incorporate the contraction property into the model, which is crucial for the model's robustness and interpretability. Experimental results in image generation and robotic manipulation verify that our method reliably improves the inference performance of FM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes