End-to-end people detection in crowded scenes
This work addresses the problem of accurate people detection in crowded environments for computer vision applications, presenting a novel approach that could improve efficiency and performance.
The paper tackles the problem of detecting people in crowded scenes by proposing an end-to-end model that decodes images directly into sets of distinct detection hypotheses, eliminating the need for post-processing like non-maximum suppression, and demonstrates its effectiveness on this challenging task.
Current people detectors operate either by scanning an image in a sliding window fashion or by classifying a discrete set of proposals. We propose a model that is based on decoding an image into a set of people detections. Our system takes an image as input and directly outputs a set of distinct detection hypotheses. Because we generate predictions jointly, common post-processing steps such as non-maximum suppression are unnecessary. We use a recurrent LSTM layer for sequence generation and train our model end-to-end with a new loss function that operates on sets of detections. We demonstrate the effectiveness of our approach on the challenging task of detecting people in crowded scenes.