CVAILGSDASSep 15, 2024

Integrating Audio Narrations to Strengthen Domain Generalization in Multimodal First-Person Action Recognition

arXiv:2409.09611v13 citationsh-index: 18
Originality Incremental advance
AI Analysis

This addresses domain shift challenges in wearable camera-based activity recognition, with incremental improvements through multimodal integration.

The paper tackles domain generalization in first-person action recognition by integrating motion, audio, and appearance features, achieving state-of-the-art performance on the ARGO1M dataset.

First-person activity recognition is rapidly growing due to the widespread use of wearable cameras but faces challenges from domain shifts across different environments, such as varying objects or background scenes. We propose a multimodal framework that improves domain generalization by integrating motion, audio, and appearance features. Key contributions include analyzing the resilience of audio and motion features to domain shifts, using audio narrations for enhanced audio-text alignment, and applying consistency ratings between audio and visual narrations to optimize the impact of audio in recognition during training. Our approach achieves state-of-the-art performance on the ARGO1M dataset, effectively generalizing across unseen scenarios and locations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes