CVDec 17, 2020

Human Mesh Recovery from Multiple Shots

arXiv:2012.09843v167 citations
AI Analysis

This work provides a method for robust 3D human mesh recovery from edited media, which is a challenging and under-explored data source for computer vision researchers and practitioners. This is an incremental improvement to existing methods.

This paper tackles the problem of 3D human mesh recovery from edited media like movies, which suffer from abrupt shot changes and truncation. The authors propose a multi-shot optimization framework that leverages the smooth 3D scene structure across shot changes to improve 3D reconstruction and generate pseudo ground truth 3D human meshes for long sequences. This data improves the robustness of single-image human mesh recovery models and enables a transformer-based temporal encoder for video that handles missing observations due to shot changes.

Videos from edited media like movies are a useful, yet under-explored source of information. The rich variety of appearance and interactions between humans depicted over a large temporal context in these films could be a valuable source of data. However, the richness of data comes at the expense of fundamental challenges such as abrupt shot changes and close up shots of actors with heavy truncation, which limits the applicability of existing human 3D understanding methods. In this paper, we address these limitations with an insight that while shot changes of the same scene incur a discontinuity between frames, the 3D structure of the scene still changes smoothly. This allows us to handle frames before and after the shot change as multi-view signal that provide strong cues to recover the 3D state of the actors. We propose a multi-shot optimization framework, which leads to improved 3D reconstruction and mining of long sequences with pseudo ground truth 3D human mesh. We show that the resulting data is beneficial in the training of various human mesh recovery models: for single image, we achieve improved robustness; for video we propose a pure transformer-based temporal encoder, which can naturally handle missing observations due to shot changes in the input frames. We demonstrate the importance of the insight and proposed models through extensive experiments. The tools we develop open the door to processing and analyzing in 3D content from a large library of edited media, which could be helpful for many downstream applications. Project page: https://geopavlakos.github.io/multishot

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes