CLAIJan 22

Artificial Rigidities vs. Biological Noise: A Comparative Analysis of Multisensory Integration in AV-HuBERT and Human Observers

arXiv:2601.15869v1
Originality Incremental advance
AI Analysis

This work addresses the problem of evaluating AI bio-fidelity in multisensory integration for researchers in speech perception and AI, though it is incremental in highlighting limitations of current models.

The study compared AV-HuBERT's response to incongruent audiovisual stimuli (McGurk effect) with human observers, finding nearly identical auditory dominance rates (32.0% vs. 31.8%) but a deterministic bias in the model toward phonetic fusion (68.0% vs. 47.7% in humans).

This study evaluates AV-HuBERT's perceptual bio-fidelity by benchmarking its response to incongruent audiovisual stimuli (McGurk effect) against human observers (N=44). Results reveal a striking quantitative isomorphism: AI and humans exhibited nearly identical auditory dominance rates (32.0% vs. 31.8%), suggesting the model captures biological thresholds for auditory resistance. However, AV-HuBERT showed a deterministic bias toward phonetic fusion (68.0%), significantly exceeding human rates (47.7%). While humans displayed perceptual stochasticity and diverse error profiles, the model remained strictly categorical. Findings suggest that current self-supervised architectures mimic multisensory outcomes but lack the neural variability inherent to human speech perception.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes