CVLGJul 29, 2025

TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras

arXiv:2508.00913v14 citationsh-index: 12
Originality Highly original
AI Analysis

This work addresses the need for effective pre-training methods for event-based perception, which is crucial for applications in robotics and autonomous systems, though it is incremental in improving existing self-supervised learning approaches.

The paper tackles the problem of self-supervised pre-training for event cameras by introducing TESPEC, a framework that leverages long event sequences to learn spatio-temporal information, achieving state-of-the-art results in downstream tasks like object detection, semantic segmentation, and monocular depth estimation.

Long-term temporal information is crucial for event-based perception tasks, as raw events only encode pixel brightness changes. Recent works show that when trained from scratch, recurrent models achieve better results than feedforward models in these tasks. However, when leveraging self-supervised pre-trained weights, feedforward models can outperform their recurrent counterparts. Current self-supervised learning (SSL) methods for event-based pre-training largely mimic RGB image-based approaches. They pre-train feedforward models on raw events within a short time interval, ignoring the temporal information of events. In this work, we introduce TESPEC, a self-supervised pre-training framework tailored for learning spatio-temporal information. TESPEC is well-suited for recurrent models, as it is the first framework to leverage long event sequences during pre-training. TESPEC employs the masked image modeling paradigm with a new reconstruction target. We design a novel method to accumulate events into pseudo grayscale videos containing high-level semantic information about the underlying scene, which is robust to sensor noise and reduces motion blur. Reconstructing this target thus requires the model to reason about long-term history of events. Extensive experiments demonstrate our state-of-the-art results in downstream tasks, including object detection, semantic segmentation, and monocular depth estimation. Project webpage: https://mhdmohammadi.github.io/TESPEC_webpage.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes