Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework
This work addresses fundamental research questions in multimodal AI by providing a principled method to analyze interactions, which is incremental as it builds on existing information theory concepts but applies them to modern multimodal challenges.
The paper tackles the problem of quantifying interactions in multimodal tasks by proposing an information-theoretic framework to measure redundancy, uniqueness, and synergy, and introduces scalable estimators validated on synthetic and real-world benchmarks, with applications in model selection and domain-specific case studies.
The recent explosion of interest in multimodal applications has resulted in a wide selection of datasets and methods for representing and integrating information from different modalities. Despite these empirical advances, there remain fundamental research questions: How can we quantify the interactions that are necessary to solve a multimodal task? Subsequently, what are the most suitable multimodal models to capture these interactions? To answer these questions, we propose an information-theoretic approach to quantify the degree of redundancy, uniqueness, and synergy relating input modalities with an output task. We term these three measures as the PID statistics of a multimodal distribution (or PID for short), and introduce two new estimators for these PID statistics that scale to high-dimensional distributions. To validate PID estimation, we conduct extensive experiments on both synthetic datasets where the PID is known and on large-scale multimodal benchmarks where PID estimations are compared with human annotations. Finally, we demonstrate their usefulness in (1) quantifying interactions within multimodal datasets, (2) quantifying interactions captured by multimodal models, (3) principled approaches for model selection, and (4) three real-world case studies engaging with domain experts in pathology, mood prediction, and robotic perception where our framework helps to recommend strong multimodal models for each application.