Efficient Bayer-Domain Video Computer Vision with Fast Motion Estimation and Learned Perception Residual
This work addresses efficiency challenges in video computer vision systems, offering a domain-specific solution that is incremental in its approach.
The paper tackled the computational burden in video computer vision by proposing a framework that removes the image signal processor and uses Bayer raw data directly, along with fast motion estimation and perception residual networks, achieving substantial acceleration with minor performance degradation.
Video computer vision systems face substantial computational burdens arising from two fundamental challenges: eliminating unnecessary processing and reducing temporal redundancy in back-end inference while maintaining accuracy with minimal extra computation. To address these issues, we propose an efficient video computer vision framework that jointly optimizes both the front end and back end of the pipeline. On the front end, we remove the traditional image signal processor (ISP) and feed Bayer raw measurements directly into Bayer-domain vision models, avoiding costly human-oriented ISP operations. On the back end, we introduce a fast and highly parallel motion estimation algorithm that extracts inter-frame temporal correspondence to avoid redundant computation. To mitigate artifacts caused by motion inaccuracies, we further employ lightweight perception residual networks that directly learn perception-level residuals and refine the propagated features. Experiments across multiple models and tasks demonstrate that our system achieves substantial acceleration with only minor performance degradation.