CVFeb 10, 2022

Exploiting Spatial Sparsity for Event Cameras with Visual Transformers

arXiv:2202.05054v141 citations
Originality Incremental advance
AI Analysis

This work addresses efficiency challenges for event camera applications, but it is incremental as it fine-tunes an existing ViT model with patch selection.

The paper tackled the problem of processing spatially sparse event camera data efficiently by proposing a visual transformer (ViT) architecture that selects active patches, reducing the average number of patches by at least 50% with only a 0.34% drop in classification accuracy on the N-Caltech101 dataset.

Event cameras report local changes of brightness through an asynchronous stream of output events. Events are spatially sparse at pixel locations with little brightness variation. We propose using a visual transformer (ViT) architecture to leverage its ability to process a variable-length input. The input to the ViT consists of events that are accumulated into time bins and spatially separated into non-overlapping sub-regions called patches. Patches are selected when the number of nonzero pixel locations within a sub-region is above a threshold. We show that by fine-tuning a ViT model on the selected active patches, we can reduce the average number of patches fed into the backbone during the inference by at least 50% with only a minor drop (0.34%) of the classification accuracy on the N-Caltech101 dataset. This reduction translates into a decrease of 51% in Multiply-Accumulate (MAC) operations and an increase of 46% in the inference speed using a server CPU.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes