CVJul 12, 2025

Dynamic Inter-Class Confusion-Aware Encoder for Audio-Visual Fusion in Human Activity Recognition

arXiv:2507.09323v1h-index: 1
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving recognition accuracy for similar activities in audio-visual data, which is an incremental advancement in human activity recognition.

The paper tackles the problem of distinguishing easily confused classes in audio-visual human activity recognition by proposing the Dynamic Inter-Class Confusion-Aware Encoder (DICCAE), which aligns audio-video representations at a fine-grained level and dynamically adjusts confusion loss, achieving a top-1 accuracy of 65.5% on the VGGSound dataset.

Humans do not understand individual events in isolation; rather, they generalize concepts within classes and compare them to others. Existing audio-video pre-training paradigms only focus on the alignment of the overall audio-video modalities, without considering the reinforcement of distinguishing easily confused classes through cognitive induction and contrast during training. This paper proposes the Dynamic Inter-Class Confusion-Aware Encoder (DICCAE), an encoder that aligns audio-video representations at a fine-grained, category-level. DICCAE addresses category confusion by dynamically adjusting the confusion loss based on inter-class confusion degrees, thereby enhancing the model's ability to distinguish between similar activities. To further extend the application of DICCAE, we also introduce a novel training framework that incorporates both audio and video modalities, as well as their fusion. To mitigate the scarcity of audio-video data in the human activity recognition task, we propose a cluster-guided audio-video self-supervised pre-training strategy for DICCAE. DICCAE achieves near state-of-the-art performance on the VGGSound dataset, with a top-1 accuracy of 65.5%. We further evaluate its feature representation quality through extensive ablation studies, validating the necessity of each module.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes