Shifting the Baseline: Single Modality Performance on Visual Navigation & QA
This work addresses the issue of inflated performance claims in multimodal research, providing a critical baseline for more accurate evaluation.
The paper tackled the problem of overestimating multimodal model performance by showing that unimodal baselines, which better capture dataset biases, outperform existing baselines by up to 29% on visual navigation and QA datasets.
We demonstrate the surprising strength of unimodal baselines in multimodal domains, and make concrete recommendations for best practices in future research. Where existing work often compares against random or majority class baselines, we argue that unimodal approaches better capture and reflect dataset biases and therefore provide an important comparison when assessing the performance of multimodal techniques. We present unimodal ablations on three recent datasets in visual navigation and QA, seeing an up to 29% absolute gain in performance over published baselines.