CVMar 24, 2025

Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks

arXiv:2503.18637v11 citationsh-index: 137CVPR
Originality Incremental advance
AI Analysis

This work addresses biases in video understanding benchmarks for researchers and developers, but it is incremental as it builds on existing datasets and methods.

The paper tackles the problem of representation biases in video benchmarks, such as object bias, by proposing the Unbiased through Textual Description (UTD) benchmark, which involves generating frame-wise textual descriptions to debias 12 popular datasets and benchmarking 30 state-of-the-art models on original and debiased splits, resulting in the release of structured descriptions and object-debiased test splits.

We propose a new "Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets to enable a more robust assessment of video understanding capabilities. Namely, we tackle the problem that current video benchmarks may suffer from different representation biases, e.g., object bias or single-frame bias, where mere recognition of objects or utilization of only a single frame is sufficient for correct prediction. We leverage VLMs and LLMs to analyze and debias benchmarks from such representation biases. Specifically, we generate frame-wise textual descriptions of videos, filter them for specific information (e.g. only objects) and leverage them to examine representation biases across three dimensions: 1) concept bias - determining if a specific concept (e.g., objects) alone suffice for prediction; 2) temporal bias - assessing if temporal information contributes to prediction; and 3) common sense vs. dataset bias - evaluating whether zero-shot reasoning or dataset correlations contribute to prediction. We conduct a systematic analysis of 12 popular video classification and retrieval datasets and create new object-debiased test splits for these datasets. Moreover, we benchmark 30 state-of-the-art video models on original and debiased splits and analyze biases in the models. To facilitate the future development of more robust video understanding benchmarks and models, we release: "UTD-descriptions", a dataset with our rich structured descriptions for each dataset, and "UTD-splits", a dataset of object-debiased test splits.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes