Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles
This addresses the issue of family bias in VLM ensembles for researchers and practitioners, leading to improved benchmark performance, though it is incremental as it builds on existing ensemble techniques.
The paper tackled the problem of correlated errors among vision-language models from the same architectural family in ensembles, which reduces effective diversity and creates a misleading tier where accuracy drops to 0% despite correct individual models. The result includes methods like Learned Candidate Scoring that achieved gains of +0.68% on VQAv2, +0.61% on TextVQA, and +2.45% on GQA, with generalization confirmed by reaching 87.83% on VQAv2 test-standard.
Ensembling Vision-Language Models (VLMs) from different providers maximizes benchmark accuracy, yet models from the same architectural family share correlated errors that standard voting ignores. We study this structure across 17 VLMs from 8 families on VQAv2, TextVQA, and GQA. Family-correlated errors reduce effective ensemble dimensionality to 2.5-3.6 independent voters and create a Misleading tier (1.5-6.5% of questions) where correlated majority errors destroy accuracy to 0% despite the best model being correct. We propose three family-aware methods. Hierarchical Family Voting (HFV) aggregates within families before voting across them, recovering +18-26 pp on the Misleading tier. QualRCCV, a training-free method weighting models by calibration, family quality, and inverse family size, is the first to beat calibrated voting on all three benchmarks (p<0.05). Learned Candidate Scoring (LCS) trains a cross-validated classifier to re-rank candidate answers using support breadth, family diversity, and model quality, achieving the largest gains: +0.68% VQAv2, +0.61% TextVQA, +2.45% GQA -- all significant -- and is the only learned method that never degrades any benchmark. On VQAv2 test-standard (EvalAI), LCS reaches 87.83% with 12 models, confirming generalization.