CVNov 3, 2024

Activating Self-Attention for Multi-Scene Absolute Pose Regression

arXiv:2411.01443v24 citationsh-index: 6NIPS
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in camera pose estimation for real-world environments, offering incremental improvements to transformer-based models.

The paper tackled the problem of collapsed self-attention maps in transformer-based models for multi-scene absolute pose regression, which limits representation capacity, and proposed solutions including an auxiliary loss and fixed positional encoding to activate self-attention, resulting in outperforming existing methods in both outdoor and indoor scenes.

Multi-scene absolute pose regression addresses the demand for fast and memory-efficient camera pose estimation across various real-world environments. Nowadays, transformer-based model has been devised to regress the camera pose directly in multi-scenes. Despite its potential, transformer encoders are underutilized due to the collapsed self-attention map, having low representation capacity. This work highlights the problem and investigates it from a new perspective: distortion of query-key embedding space. Based on the statistical analysis, we reveal that queries and keys are mapped in completely different spaces while only a few keys are blended into the query region. This leads to the collapse of the self-attention map as all queries are considered similar to those few keys. Therefore, we propose simple but effective solutions to activate self-attention. Concretely, we present an auxiliary loss that aligns queries and keys, preventing the distortion of query-key space and encouraging the model to find global relations by self-attention. In addition, the fixed sinusoidal positional encoding is adopted instead of undertrained learnable one to reflect appropriate positional clues into the inputs of self-attention. As a result, our approach resolves the aforementioned problem effectively, thus outperforming existing methods in both outdoor and indoor scenes.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes