Point-to-Point: Sparse Motion Guidance for Controllable Video Editing
This work addresses the problem of motion preservation in video editing for users needing high-fidelity edits, representing an incremental improvement over existing methods by refining motion representation.
The paper tackles the challenge of preserving motion while editing subjects in videos by introducing anchor tokens, a novel motion representation that captures essential motion patterns using a video diffusion model's prior, leading to more controllable and semantically aligned edits with superior performance in edit and motion fidelity.
Accurately preserving motion while editing a subject remains a core challenge in video editing tasks. Existing methods often face a trade-off between edit and motion fidelity, as they rely on motion representations that are either overfitted to the layout or only implicitly defined. To overcome this limitation, we revisit point-based motion representation. However, identifying meaningful points remains challenging without human input, especially across diverse video scenarios. To address this, we propose a novel motion representation, anchor tokens, that capture the most essential motion patterns by leveraging the rich prior of a video diffusion model. Anchor tokens encode video dynamics compactly through a small number of informative point trajectories and can be flexibly relocated to align with new subjects. This allows our method, Point-to-Point, to generalize across diverse scenarios. Extensive experiments demonstrate that anchor tokens lead to more controllable and semantically aligned video edits, achieving superior performance in terms of edit and motion fidelity.