Visual Compiler: Synthesizing a Scene-Specific Pedestrian Detector and Pose Estimator
This provides a solution for bootstrapping human detection and pose estimation in scenarios where real annotated data is limited, which is incremental as it builds on existing computer graphics and neural network techniques.
The paper tackles the problem of pedestrian detection and pose estimation in scenes with scarce or no real annotated data by introducing a Visual Compiler that synthesizes a scene-specific detector and estimator using computer graphics renders from a single image and auxiliary scene information. The result shows that this approach outperforms off-the-shelf state-of-the-art methods trained on real data.
We introduce the concept of a Visual Compiler that generates a scene specific pedestrian detector and pose estimator without any pedestrian observations. Given a single image and auxiliary scene information in the form of camera parameters and geometric layout of the scene, the Visual Compiler first infers geometrically and photometrically accurate images of humans in that scene through the use of computer graphics rendering. Using these renders we learn a scene-and-region specific spatially-varying fully convolutional neural network, for simultaneous detection, pose estimation and segmentation of pedestrians. We demonstrate that when real human annotated data is scarce or non-existent, our data generation strategy can provide an excellent solution for bootstrapping human detection and pose estimation. Experimental results show that our approach outperforms off-the-shelf state-of-the-art pedestrian detectors and pose estimators that are trained on real data.