CVROJun 25, 2025

Lightweight Multi-Frame Integration for Robust YOLO Object Detection in Videos

arXiv:2506.20550v15 citationsh-index: 9EMCR
Originality Incremental advance
AI Analysis

This addresses transient challenges like motion blur and occlusions in practical applications such as surveillance and autonomous driving, though it is incremental as it builds on existing YOLO architectures.

The paper tackles the problem of video object detection by proposing a lightweight multi-frame integration strategy that stacks consecutive frames as input to YOLO-based detectors while supervising only a single target frame, improving detection robustness with minimal architectural changes. Experiments on MOT20Det and a new BOAT360 dataset show the method effectively narrows performance gaps between compact and heavy networks.

Modern image-based object detection models, such as YOLOv7, primarily process individual frames independently, thus ignoring valuable temporal context naturally present in videos. Meanwhile, existing video-based detection methods often introduce complex temporal modules, significantly increasing model size and computational complexity. In practical applications such as surveillance and autonomous driving, transient challenges including motion blur, occlusions, and abrupt appearance changes can severely degrade single-frame detection performance. To address these issues, we propose a straightforward yet highly effective strategy: stacking multiple consecutive frames as input to a YOLO-based detector while supervising only the output corresponding to a single target frame. This approach leverages temporal information with minimal modifications to existing architectures, preserving simplicity, computational efficiency, and real-time inference capability. Extensive experiments on the challenging MOT20Det and our BOAT360 datasets demonstrate that our method improves detection robustness, especially for lightweight models, effectively narrowing the gap between compact and heavy detection networks. Additionally, we contribute the BOAT360 benchmark dataset, comprising annotated fisheye video sequences captured from a boat, to support future research in multi-frame video object detection in challenging real-world scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes