CVJun 2

Knowledge-Preserved Model Tuning in Null-Space for Robust Spatio-Temporal Video Grounding

arXiv:2606.0353924.6h-index: 9

AI Analysis

For video grounding models, this work addresses the practical issue of low-quality videos without disrupting pre-trained knowledge, offering a robust tuning method.

The paper tackles the problem of spatio-temporal video grounding under low-quality inputs, proposing Null-Space Tuning (NST) that preserves pre-trained knowledge while adapting to degraded inputs. NST achieves state-of-the-art performance on a Mixed-Quality benchmark.

Spatio-Temporal Video Grounding aims to localize object tubes based on textual queries. While recent methods have achieved remarkable success, they mainly focus on high-quality(HQ) inputs, neglecting the widespread presence of low-quality(LQ) videos in real-world scenarios. Although tuning methods like LoRA can adapt to degraded inputs, they inevitably disrupt pre-trained knowledge. To address this, we propose Null-Space Tuning (NST). This framework exploits the geometric property that adding vectors within the null-space of frozen weights to the layer input does not affect the output. Leveraging this, NST injects learnable residuals into input features that can be selectively invisible to the pre-trained backbone. Specifically, NST combines the Quality-Adaptive Unit and Dual-Space Reparameterization to synthesize these residuals by confining components for HQ inputs to the null-space, while directing restoration components for LQ inputs to the non-null space. As the frozen weights eliminate null-space components, we effectively rectify degraded inputs while preserving pre-trained knowledge for HQ inputs. Extensive experiments show that NST outperforms state-of-the-art methods on our Mixed-Quality benchmark.

View on arXiv PDF

Similar