Artificial Rigidities vs. Biological Noise: A Comparative Analysis of Multisensory Integration in AV-HuBERT and Human Observers
This work addresses the problem of evaluating AI bio-fidelity in multisensory integration for researchers in speech perception and AI, though it is incremental in highlighting limitations of current models.
The study compared AV-HuBERT's response to incongruent audiovisual stimuli (McGurk effect) with human observers, finding nearly identical auditory dominance rates (32.0% vs. 31.8%) but a deterministic bias in the model toward phonetic fusion (68.0% vs. 47.7% in humans).
This study evaluates AV-HuBERT's perceptual bio-fidelity by benchmarking its response to incongruent audiovisual stimuli (McGurk effect) against human observers (N=44). Results reveal a striking quantitative isomorphism: AI and humans exhibited nearly identical auditory dominance rates (32.0% vs. 31.8%), suggesting the model captures biological thresholds for auditory resistance. However, AV-HuBERT showed a deterministic bias toward phonetic fusion (68.0%), significantly exceeding human rates (47.7%). While humans displayed perceptual stochasticity and diverse error profiles, the model remained strictly categorical. Findings suggest that current self-supervised architectures mimic multisensory outcomes but lack the neural variability inherent to human speech perception.