CVIRMMSep 17, 2023

Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention

arXiv:2309.09311v1h-index: 61
Originality Incremental advance
AI Analysis

This addresses a specific bias issue in text-video retrieval, which is incremental as it focuses on mitigating a known temporal bias rather than introducing a new paradigm.

The paper tackles the problem of frame length bias in text-video retrieval, where models may rely on spurious correlations due to discrepancies between training and test sets, and proposes a causal debiasing approach that outperforms baseline and state-of-the-art methods on metrics like nDCG across multiple datasets.

Many studies focus on improving pretraining or developing new backbones in text-video retrieval. However, existing methods may suffer from the learning and inference bias issue, as recent research suggests in other text-video-related tasks. For instance, spatial appearance features on action recognition or temporal object co-occurrences on video scene graph generation could induce spurious correlations. In this work, we present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips, which is the first such attempt for a text-video retrieval task, to the best of our knowledge. We first hypothesise and verify the bias on how it would affect the model illustrated with a baseline study. Then, we propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets. Our model overpasses the baseline and SOTA on nDCG, a semantic-relevancy-focused evaluation metric which proves the bias is mitigated, as well as on the other conventional metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes