CLSep 19, 2025

The Curious Case of Visual Grounding: Different Effects for Speech- and Text-based Language Encoders

Adrian Sauter, Willem Zuidema, Marianne de Heer Kloots

arXiv:2509.15837v14.92 citationsh-index: 9

Originality Incremental advance

AI Analysis

This research addresses the problem of enhancing semantic understanding in speech-based AI models for more efficient development, though it is incremental as it builds on existing grounding methods.

The study investigated how visual grounding during training affects language processing in speech- and text-based deep learning models, finding that it increases alignment between spoken and written language representations but does not improve semantic discriminability in speech-based models.

How does visual information included in training affect language processing in audio- and text-based deep learning models? We explore how such visual grounding affects model-internal representations of words, and find substantially different effects in speech- vs. text-based language encoders. Firstly, global representational comparisons reveal that visual grounding increases alignment between representations of spoken and written language, but this effect seems mainly driven by enhanced encoding of word identity rather than meaning. We then apply targeted clustering analyses to probe for phonetic vs. semantic discriminability in model representations. Speech-based representations remain phonetically dominated with visual grounding, but in contrast to text-based representations, visual grounding does not improve semantic discriminability. Our findings could usefully inform the development of more efficient methods to enrich speech-based models with visually-informed semantics.

View on arXiv PDF

Similar