CV MMApr 15, 2025

Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset

Elisa Ancarani, Julie Tores, Lucile Sassatelli, Rémy Sun, Hui-Yin Wu, Frédéric Precioso

arXiv:2504.11232v13.6h-index: 15

Originality Incremental advance

AI Analysis

This work addresses the need for robust, self-explainable video models in complex video analysis, representing an incremental advancement in interpretable multimodal learning.

The study tackled the problem of improving multimodal video interpretation models by using concept-informed supervision with modality-specific datasets, resulting in models that outperform traditional training and enabling late fusion to approach early fusion performance.

We examine the impact of concept-informed supervision on multimodal video interpretation models using MOByGaze, a dataset containing human-annotated explanatory concepts. We introduce Concept Modality Specific Datasets (CMSDs), which consist of data subsets categorized by the modality (visual, textual, or audio) of annotated concepts. Models trained on CMSDs outperform those using traditional legacy training in both early and late fusion approaches. Notably, this approach enables late fusion models to achieve performance close to that of early fusion models. These findings underscore the importance of modality-specific annotations in developing robust, self-explainable video models and contribute to advancing interpretable multimodal learning in complex video analysis.

View on arXiv PDF

Similar