CVJun 2

Unified Video-Action Joint Denoising for Dexterous Action and Data Generation

arXiv:2606.0386831.7h-index: 11
Predicted impact top 82% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the challenge of aligning video priors with robot actions for dexterous manipulation, offering a unified framework for both policy learning and data generation.

The paper proposes Donk, a unified video-action denoising model that jointly models interaction videos and hand trajectories, improving dexterous action generation and enabling text-conditioned data generation. It achieves better trajectory accuracy and video fidelity compared to prior methods.

Recent world action models leverage video foundation models by aligning broad visual-dynamics priors with executable robot actions. We revisit this alignment from a distributional perspective. Existing formulations typically narrow the aligned prior into an observation-conditioned policy distribution over future actions. In contrast, we keep the distribution broader by modeling the joint space of interaction videos and executable hand trajectories under multiple conditioning regimes. We propose Donk, a unified video-action denoising model for dexterous hands. With language, an initial image, and the initial hand state, Donk samples future videos and bimanual MANO trajectories as an action policy. Without the image condition, the same denoising architecture samples paired video-action rollouts from a text-conditioned distribution, turning the aligned video prior into a data engine. Across action, video, and text-only generation evaluations, Donk improves dexterous trajectory accuracy, preserves strong video fidelity, and produces smooth text-conditioned action rollouts under the same unified training recipe.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes