CVMMAug 24, 2025

MTNet: Learning modality-aware representation with transformer for RGBT tracking

arXiv:2508.17280v128 citationsh-index: 39ICME
Originality Incremental advance
AI Analysis

This work addresses RGBT tracking for computer vision applications, presenting an incremental improvement over existing methods.

The paper tackles the problem of robust multi-modality representation in RGBT tracking by proposing MTNet, a modality-aware tracker based on transformer, which achieves satisfactory results on three benchmarks with real-time speed.

The ability to learn robust multi-modality representation has played a critical role in the development of RGBT tracking. However, the regular fusion paradigm and the invariable tracking template remain restrictive to the feature interaction. In this paper, we propose a modality-aware tracker based on transformer, termed MTNet. Specifically, a modality-aware network is presented to explore modality-specific cues, which contains both channel aggregation and distribution module(CADM) and spatial similarity perception module (SSPM). A transformer fusion network is then applied to capture global dependencies to reinforce instance representations. To estimate the precise location and tackle the challenges, such as scale variation and deformation, we design a trident prediction head and a dynamic update strategy which jointly maintain a reliable template for facilitating inter-frame communication. Extensive experiments validate that the proposed method achieves satisfactory results compared with the state-of-the-art competitors on three RGBT benchmarks while reaching real-time speed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes