AS CV MMJun 3, 2025

SNIFR : Boosting Fine-Grained Child Harmful Content Detection Through Audio-Visual Alignment with Cascaded Cross-Transformer

Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Abu Osama Siddiqui, Sarthak Jain, Priyabrata Mallick, Jaya Sai Kiran Patibandla, Pailla Balakrishna Reddy, Arun Balaji Buduru, Rajesh Sharma

arXiv:2506.03378v13.32 citationsh-index: 12INTERSPEECH

Originality Incremental advance

AI Analysis

This addresses the need for precise detection of harmful content like violence or explicit scenes in videos for child viewers on platforms, though it appears incremental by focusing on audio-visual alignment.

The paper tackles the problem of detecting fine-grained harmful content in videos for child safety by embedding audio cues with visual features, introducing the SNIFR framework that achieves superior performance and sets a new state-of-the-art.

As video-sharing platforms have grown over the past decade, child viewership has surged, increasing the need for precise detection of harmful content like violence or explicit scenes. Malicious users exploit moderation systems by embedding unsafe content in minimal frames to evade detection. While prior research has focused on visual cues and advanced such fine-grained detection, audio features remain underexplored. In this study, we embed audio cues with visual for fine-grained child harmful content detection and introduce SNIFR, a novel framework for effective alignment. SNIFR employs a transformer encoder for intra-modality interaction, followed by a cascaded cross-transformer for inter-modality alignment. Our approach achieves superior performance over unimodal and baseline fusion methods, setting a new state-of-the-art.

View on arXiv PDF

Similar