CVMMApr 15, 2025

Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset

arXiv:2504.11232v1h-index: 15
Originality Incremental advance
AI Analysis

This work addresses the need for robust, self-explainable video models in complex video analysis, representing an incremental advancement in interpretable multimodal learning.

The study tackled the problem of improving multimodal video interpretation models by using concept-informed supervision with modality-specific datasets, resulting in models that outperform traditional training and enabling late fusion to approach early fusion performance.

We examine the impact of concept-informed supervision on multimodal video interpretation models using MOByGaze, a dataset containing human-annotated explanatory concepts. We introduce Concept Modality Specific Datasets (CMSDs), which consist of data subsets categorized by the modality (visual, textual, or audio) of annotated concepts. Models trained on CMSDs outperform those using traditional legacy training in both early and late fusion approaches. Notably, this approach enables late fusion models to achieve performance close to that of early fusion models. These findings underscore the importance of modality-specific annotations in developing robust, self-explainable video models and contribute to advancing interpretable multimodal learning in complex video analysis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes