CVMay 22, 2025

V2V: Scaling Event-Based Vision through Efficient Video-to-Voxel Simulation

Hanyue Lou, Jinxiu Liang, Minggui Teng, Yi Wang, Boxin Shi

arXiv:2505.16797v12 citationsh-index: 13Has Code

Originality Incremental advance

AI Analysis

This addresses the data scarcity and storage inefficiency limiting event-based vision models, enabling larger-scale training for improved generalization in applications like robotics and autonomous systems.

The paper tackled the problem of scaling event-based vision training datasets by introducing Video-to-Voxel (V2V), which directly converts video frames into event-based voxel grids, resulting in a 150 times reduction in storage requirements and enabling training on 10,000 videos totaling 52 hours, leading to substantial improvements in video reconstruction and optical flow estimation.

Event-based cameras offer unique advantages such as high temporal resolution, high dynamic range, and low power consumption. However, the massive storage requirements and I/O burdens of existing synthetic data generation pipelines and the scarcity of real data prevent event-based training datasets from scaling up, limiting the development and generalization capabilities of event vision models. To address this challenge, we introduce Video-to-Voxel (V2V), an approach that directly converts conventional video frames into event-based voxel grid representations, bypassing the storage-intensive event stream generation entirely. V2V enables a 150 times reduction in storage requirements while supporting on-the-fly parameter randomization for enhanced model robustness. Leveraging this efficiency, we train several video reconstruction and optical flow estimation model architectures on 10,000 diverse videos totaling 52 hours--an order of magnitude larger than existing event datasets, yielding substantial improvements.

View on arXiv PDF Code

Similar