Fusion or Confusion? Multimodal Complexity Is Not All You Need
This work addresses methodological rigor in multimodal learning research, highlighting that incremental architectural novelty may not yield reliable gains, which is important for researchers and practitioners seeking robust and trustworthy evaluations.
The authors challenged the assumption that complex multimodal architectures improve performance by conducting a large-scale empirical study of 19 methods under standardized conditions, finding that a simple late-fusion Transformer baseline (SimBaMM) often performs on par with or better than more complex methods, especially in small-data settings.
Deep learning architectures for multimodal learning have increased in complexity, driven by the assumption that multimodal-specific methods improve performance. We challenge this assumption through a large-scale empirical study reimplementing 19 high-impact methods under standardized conditions. We evaluate them across nine diverse datasets with up to 23 modalities, and test their generalizability to new tasks beyond their original scope, including settings with missing modalities. We propose a Simple Baseline for Multimodal Learning (SimBaMM), a late-fusion Transformer architecture, and demonstrate that under standardized experimental conditions with rigorous hyperparameter tuning of all methods, more complex architectures do not reliably outperform SimBaMM. Statistical analyses show that complex methods perform on par with SimBaMM and often fail to consistently outperform well-tuned unimodal baselines, especially in small-data settings. To support our findings, we include a case study highlighting common methodological shortcomings in the literature followed by a pragmatic reliability checklist to promote comparable, robust, and trustworthy future evaluations. In summary, we argue for a shift in focus: away from the pursuit of architectural novelty and toward methodological rigor.