CV AIAug 6, 2025

Revealing Temporal Label Noise in Multimodal Hateful Video Classification

Shuonan Yang, Tailin Chen, Rahul Singh, Jiangbei Yue, Jianbo Jiao, Zeyu Fu

arXiv:2508.04900v13 citationsh-index: 2Has CodeProceedings of the 4th International Workshop on Multimodal Human Understanding for the Web and Social Media

Originality Synthesis-oriented

AI Analysis

This addresses the problem of label ambiguity in hate speech detection for researchers and practitioners, but it is incremental as it analyzes existing datasets rather than proposing a new method.

The paper investigated how coarse video-level annotations introduce label noise in multimodal hateful video detection by trimming videos to isolate hateful segments, revealing that this noise alters model decision boundaries and weakens classification confidence.

The rapid proliferation of online multimedia content has intensified the spread of hate speech, presenting critical societal and regulatory challenges. While recent work has advanced multimodal hateful video detection, most approaches rely on coarse, video-level annotations that overlook the temporal granularity of hateful content. This introduces substantial label noise, as videos annotated as hateful often contain long non-hateful segments. In this paper, we investigate the impact of such label ambiguity through a fine-grained approach. Specifically, we trim hateful videos from the HateMM and MultiHateClip English datasets using annotated timestamps to isolate explicitly hateful segments. We then conduct an exploratory analysis of these trimmed segments to examine the distribution and characteristics of both hateful and non-hateful content. This analysis highlights the degree of semantic overlap and the confusion introduced by coarse, video-level annotations. Finally, controlled experiments demonstrated that time-stamp noise fundamentally alters model decision boundaries and weakens classification confidence, highlighting the inherent context dependency and temporal continuity of hate speech expression. Our findings provide new insights into the temporal dynamics of multimodal hateful videos and highlight the need for temporally aware models and benchmarks for improved robustness and interpretability. Code and data are available at https://github.com/Multimodal-Intelligence-Lab-MIL/HatefulVideoLabelNoise.

View on arXiv PDF Code

Similar