CV ROJun 25, 2025

Lightweight Multi-Frame Integration for Robust YOLO Object Detection in Videos

Yitong Quan, Benjamin Kiefer, Martin Messmer, Andreas Zell

arXiv:2506.20550v110.25 citationsh-index: 9EMCR

Originality Incremental advance

AI Analysis

This addresses transient challenges like motion blur and occlusions in practical applications such as surveillance and autonomous driving, though it is incremental as it builds on existing YOLO architectures.

The paper tackles the problem of video object detection by proposing a lightweight multi-frame integration strategy that stacks consecutive frames as input to YOLO-based detectors while supervising only a single target frame, improving detection robustness with minimal architectural changes. Experiments on MOT20Det and a new BOAT360 dataset show the method effectively narrows performance gaps between compact and heavy networks.

Modern image-based object detection models, such as YOLOv7, primarily process individual frames independently, thus ignoring valuable temporal context naturally present in videos. Meanwhile, existing video-based detection methods often introduce complex temporal modules, significantly increasing model size and computational complexity. In practical applications such as surveillance and autonomous driving, transient challenges including motion blur, occlusions, and abrupt appearance changes can severely degrade single-frame detection performance. To address these issues, we propose a straightforward yet highly effective strategy: stacking multiple consecutive frames as input to a YOLO-based detector while supervising only the output corresponding to a single target frame. This approach leverages temporal information with minimal modifications to existing architectures, preserving simplicity, computational efficiency, and real-time inference capability. Extensive experiments on the challenging MOT20Det and our BOAT360 datasets demonstrate that our method improves detection robustness, especially for lightweight models, effectively narrowing the gap between compact and heavy detection networks. Additionally, we contribute the BOAT360 benchmark dataset, comprising annotated fisheye video sequences captured from a boat, to support future research in multi-frame video object detection in challenging real-world scenarios.

View on arXiv PDF

Similar