CVMMDec 27, 2024

Enhancing Vision-Language Tracking by Effectively Converting Textual Cues into Visual Cues

arXiv:2412.19648v18 citationsh-index: 21ICASSP
Originality Incremental advance
AI Analysis

This addresses the problem of modality misalignment in vision-language tracking for researchers, though it is incremental as it builds on existing foundation models.

The paper tackles the data imbalance in Vision-Language Tracking by proposing CTVLT, a plug-and-play method that converts textual cues into visual heatmaps using foundation models, achieving state-of-the-art performance on benchmarks.

Vision-Language Tracking (VLT) aims to localize a target in video sequences using a visual template and language description. While textual cues enhance tracking potential, current datasets typically contain much more image data than text, limiting the ability of VLT methods to align the two modalities effectively. To address this imbalance, we propose a novel plug-and-play method named CTVLT that leverages the strong text-image alignment capabilities of foundation grounding models. CTVLT converts textual cues into interpretable visual heatmaps, which are easier for trackers to process. Specifically, we design a textual cue mapping module that transforms textual cues into target distribution heatmaps, visually representing the location described by the text. Additionally, the heatmap guidance module fuses these heatmaps with the search image to guide tracking more effectively. Extensive experiments on mainstream benchmarks demonstrate the effectiveness of our approach, achieving state-of-the-art performance and validating the utility of our method for enhanced VLT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes