CVMay 29, 2025

Generating Fit Check Videos with a Handheld Camera

UW
arXiv:2505.23886v1h-index: 33
Originality Incremental advance
AI Analysis

This provides a more convenient solution for consumers wanting to create full-body videos without mounted cameras or careful framing, though it appears incremental over existing video generation approaches.

The paper tackles the problem of generating full-body fit check videos from just two static photos and an IMU motion reference, enabling convenient capture with handheld mobile devices. Their video diffusion-based model with novel attention mechanisms and fine-tuning strategy achieves realistic human-scene composition with consistent illumination and shadows.

Self-captured full-body videos are popular, but most deployments require mounted cameras, carefully-framed shots, and repeated practice. We propose a more convenient solution that enables full-body video capture using handheld mobile devices. Our approach takes as input two static photos (front and back) of you in a mirror, along with an IMU motion reference that you perform while holding your mobile phone, and synthesizes a realistic video of you performing a similar target motion. We enable rendering into a new scene, with consistent illumination and shadows. We propose a novel video diffusion-based model to achieve this. Specifically, we propose a parameter-free frame generation strategy, as well as a multi-reference attention mechanism, that effectively integrate appearance information from both the front and back selfies into the video diffusion model. Additionally, we introduce an image-based fine-tuning strategy to enhance frame sharpness and improve the generation of shadows and reflections, achieving a more realistic human-scene composition.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes