CVMar 20, 2023

Visual Prompt Multi-Modal Tracking

arXiv:2303.10826v2335 citationsh-index: 105Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of poor transferability and data scarcity in multi-modal object tracking for computer vision applications, offering a parameter-efficient solution.

The paper tackles the problem of adapting pre-trained RGB-based foundation models to various downstream multi-modal tracking tasks by introducing Visual Prompt multi-modal Tracking (ViPT), which learns modal-relevant prompts to stimulate knowledge from the frozen model with fewer than 1% trainable parameters, achieving state-of-the-art performance on tasks like RGB+Depth, RGB+Thermal, and RGB+Event tracking.

Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspired by the recent success of the prompt learning in language models, we develop Visual Prompt multi-modal Tracking (ViPT), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT finds a better way to stimulate the knowledge of the RGB-based model that is pre-trained at scale, meanwhile only introducing a few trainable parameters (less than 1% of model parameters). ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking. Extensive experiments show the potential of visual prompt learning for multi-modal tracking, and ViPT can achieve state-of-the-art performance while satisfying parameter efficiency. Code and models are available at https://github.com/jiawen-zhu/ViPT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes