FreeAction: Training-Free Techniques for Enhanced Fidelity of Trajectory-to-Video Generation
This work addresses a critical step for building world models and robotics foundation models, though it appears incremental as it builds on existing diffusion-based video generation with novel inference-time modifications.
The paper tackles the problem of generating realistic robot videos from action trajectories by introducing two training-free inference techniques that actively incorporate action parameters into diffusion-based generation. The methods significantly improve action coherence and visual quality across diverse robot environments.
Generating realistic robot videos from explicit action trajectories is a critical step toward building effective world models and robotics foundation models. We introduce two training-free, inference-time techniques that fully exploit explicit action parameters in diffusion-based robot video generation. Instead of treating action vectors as passive conditioning signals, our methods actively incorporate them to guide both the classifier-free guidance process and the initialization of Gaussian latents. First, action-scaled classifier-free guidance dynamically modulates guidance strength in proportion to action magnitude, enhancing controllability over motion intensity. Second, action-scaled noise truncation adjusts the distribution of initially sampled noise to better align with the desired motion dynamics. Experiments on real robot manipulation datasets demonstrate that these techniques significantly improve action coherence and visual quality across diverse robot environments.