CVDec 9, 2020

Human Detection and Segmentation via Multi-view Consensus

arXiv:2012.05119v22 citations
AI Analysis

This work addresses the problem of accurate human detection and segmentation for researchers and practitioners in computer vision, particularly in scenarios with dynamic activities and camera motion where annotated data is scarce.

This paper tackles the problem of self-supervised detection and segmentation of foreground objects, specifically humans, without annotated training data. They propose a multi-camera framework that embeds geometric constraints through multi-view consistency during training. The method outperforms state-of-the-art techniques on both visually diverse images and the Human3.6M dataset.

Self-supervised detection and segmentation of foreground objects aims for accuracy without annotated training data. However, existing approaches predominantly rely on restrictive assumptions on appearance and motion. For scenes with dynamic activities and camera motion, we propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training via coarse 3D localization in a voxel grid and fine-grained offset regression. In this manner, we learn a joint distribution of proposals over multiple views. At inference time, our method operates on single RGB images. We outperform state-of-the-art techniques both on images that visually depart from those of standard benchmarks and on those of the classical Human3.6M dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes