CVJun 23, 2025

Lightweight RGB-T Tracking with Mobile Vision Transformers

arXiv:2506.19154v11 citations
Originality Incremental advance
AI Analysis

This work addresses the need for efficient object tracking in challenging conditions like low illumination, offering a practical solution for real-time applications, though it is incremental as it builds on existing multimodal tracking methods.

The paper tackles the problem of computationally expensive multimodal RGB-T tracking by proposing a lightweight algorithm based on Mobile Vision Transformers, achieving comparable accuracy with less than 4 million parameters and 122 FPS inference speed.

Single-modality object tracking (e.g., RGB-only) encounters difficulties in challenging imaging conditions, such as low illumination and adverse weather conditions. To solve this, multimodal tracking (e.g., RGB-T models) aims to leverage complementary data such as thermal infrared features. While recent Vision Transformer-based multimodal trackers achieve strong performance, they are often computationally expensive due to large model sizes. In this work, we propose a novel lightweight RGB-T tracking algorithm based on Mobile Vision Transformers (MobileViT). Our tracker introduces a progressive fusion framework that jointly learns intra-modal and inter-modal interactions between the template and search regions using separable attention. This design produces effective feature representations that support more accurate target localization while achieving a small model size and fast inference speed. Compared to state-of-the-art efficient multimodal trackers, our model achieves comparable accuracy while offering significantly lower parameter counts (less than 4 million) and the fastest GPU inference speed of 122 frames per second. This paper is the first to propose a tracker using Mobile Vision Transformers for RGB-T tracking and multimodal tracking at large. Tracker code and model weights will be made publicly available upon acceptance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes