YOLOv1 to YOLOv11: A Comprehensive Survey of Real-Time Object Detection Innovations and Challenges
This is an incremental survey paper that synthesizes existing research for practitioners and researchers in computer vision.
This paper provides a comprehensive survey of the YOLO family of models, tracing their evolution from YOLOv1 to YOLOv11 and highlighting how they have advanced real-time object detection through architectural and algorithmic improvements. It reviews performance benchmarks and extended capabilities like instance segmentation and domain-specific applications.
Over the past decade, object detection has advanced significantly, with the YOLO (You Only Look Once) family of models transforming the landscape of real-time vision applications through unified, end-to-end detection frameworks. From YOLOv1's pioneering regression-based detection to the latest YOLOv9, each version has systematically enhanced the balance between speed, accuracy, and deployment efficiency through continuous architectural and algorithmic advancements.. Beyond core object detection, modern YOLO architectures have expanded to support tasks such as instance segmentation, pose estimation, object tracking, and domain-specific applications including medical imaging and industrial automation. This paper offers a comprehensive review of the YOLO family, highlighting architectural innovations, performance benchmarks, extended capabilities, and real-world use cases. We critically analyze the evolution of YOLO models and discuss emerging research directions that extend their impact across diverse computer vision domains.