CVMar 17

SpikeCLR: Contrastive Self-Supervised Learning for Few-Shot Event-Based Vision using Spiking Neural Networks

Maxime Vaillant, Axel Carlier, Lai Xing Ng, Christophe Hurter, Benoit R. Cottereau

arXiv:2603.1633815.8h-index: 12

AI Analysis

This addresses the challenge of label scarcity for event-based vision applications, enabling more efficient deployment on neuromorphic hardware, though it is incremental as it adapts existing methods to the spiking domain.

The paper tackles the problem of training Spiking Neural Networks (SNNs) for event-based vision with limited labeled data by introducing SpikeCLR, a contrastive self-supervised learning framework, and demonstrates that it outperforms supervised learning in few-shot settings, achieving consistent gains across benchmarks like CIFAR10-DVS and DVS-Gesture.

Event-based vision sensors provide significant advantages for high-speed perception, including microsecond temporal resolution, high dynamic range, and low power consumption. When combined with Spiking Neural Networks (SNNs), they can be deployed on neuromorphic hardware, enabling energy-efficient applications on embedded systems. However, this potential is severely limited by the scarcity of large-scale labeled datasets required to effectively train such models. In this work, we introduce SpikeCLR, a contrastive self-supervised learning framework that enables SNNs to learn robust visual representations from unlabeled event data. We adapt prior frame-based methods to the spiking domain using surrogate gradient training and introduce a suite of event-specific augmentations that leverage spatial, temporal, and polarity transformations. Through extensive experiments on CIFAR10-DVS, N-Caltech101, N-MNIST, and DVS-Gesture benchmarks, we demonstrate that self-supervised pretraining with subsequent fine-tuning outperforms supervised learning in low-data regimes, achieving consistent gains in few-shot and semi-supervised settings. Our ablation studies reveal that combining spatial and temporal augmentations is critical for learning effective spatio-temporal invariances in event data. We further show that learned representations transfer across datasets, contributing to efforts for powerful event-based models in label-scarce settings.

View on arXiv PDF

Similar