YOLO -- You only look 10647 times
This work offers a clearer understanding of YOLO's mechanism for researchers in computer vision, though it is incremental as it reframes an existing method without introducing new performance gains.
The paper reinterprets YOLO as a parallel classification of 10647 fixed region proposals, bridging the conceptual gap between single-stage, two-stage, and classification models, and provides interactive tools for visualizing YOLO's processing streams.
With this work we are explaining the "You Only Look Once" (YOLO) single-stage object detection approach as a parallel classification of 10647 fixed region proposals. We support this view by showing that each of YOLOs output pixel is attentive to a specific sub-region of previous layers, comparable to a local region proposal. This understanding reduces the conceptual gap between YOLO-like single-stage object detection models, RCNN-like two-stage region proposal based models, and ResNet-like image classification models. In addition, we created interactive exploration tools for a better visual understanding of the YOLO information processing streams: https://limchr.github.io/yolo_visualization