Roy Kapon

2papers

2 Papers

CVOct 23, 2023
MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion

Roy Kapon, Guy Tevet, Daniel Cohen-Or et al.

We introduce Multi-view Ancestral Sampling (MAS), a method for 3D motion generation, using 2D diffusion models that were trained on motions obtained from in-the-wild videos. As such, MAS opens opportunities to exciting and diverse fields of motion previously under-explored as 3D data is scarce and hard to collect. MAS works by simultaneously denoising multiple 2D motion sequences representing different views of the same 3D motion. It ensures consistency across all views at each diffusion step by combining the individual generations into a unified 3D sequence, and projecting it back to the original views. We demonstrate MAS on 2D pose data acquired from videos depicting professional basketball maneuvers, rhythmic gymnastic performances featuring a ball apparatus, and horse races. In each of these domains, 3D motion capture is arduous, and yet, MAS generates diverse and realistic 3D sequences. Unlike the Score Distillation approach, which optimizes each sample by repeatedly applying small fixes, our method uses a sampling process that was constructed for the diffusion framework. As we demonstrate, MAS avoids common issues such as out-of-domain sampling and mode-collapse. https://guytevet.github.io/mas-page/

CVMar 2, 2023
Human Motion Diffusion as a Generative Prior

Yonatan Shafir, Guy Tevet, Roy Kapon et al.

Recent work has demonstrated the significant potential of denoising diffusion models for generating human motion, including text-to-motion capabilities. However, these methods are restricted by the paucity of annotated motion data, a focus on single-person motions, and a lack of detailed control. In this paper, we introduce three forms of composition based on diffusion priors: sequential, parallel, and model composition. Using sequential composition, we tackle the challenge of long sequence generation. We introduce DoubleTake, an inference-time method with which we generate long animations consisting of sequences of prompted intervals and their transitions, using a prior trained only for short clips. Using parallel composition, we show promising steps toward two-person generation. Beginning with two fixed priors as well as a few two-person training examples, we learn a slim communication block, ComMDM, to coordinate interaction between the two resulting motions. Lastly, using model composition, we first train individual priors to complete motions that realize a prescribed motion for a given joint. We then introduce DiffusionBlending, an interpolation mechanism to effectively blend several such models to enable flexible and efficient fine-grained joint and trajectory-level control and editing. We evaluate the composition methods using an off-the-shelf motion diffusion model, and further compare the results to dedicated models trained for these specific tasks.