CVJul 29, 2025

VeS: Teaching Pixels to Listen Without Supervision

arXiv:2507.22008v1
Originality Incremental advance
AI Analysis

This work addresses the challenge of making audio-visual models effective in noisy, code-switched multilingual environments typical of developing regions, representing an incremental improvement in method adaptation.

The paper tackled the problem of dense audio-visual models in low-resource, multilingual settings, showing that a dense contrastive objective improves audio-visual retrieval by +59% relative R@1 over global pooling and enables sharp zero-shot localization with a frozen vision backbone.

Recent dense audio-visual (AV) models achieve impressive retrieval and emergent localization, but almost all evidence comes from English-centric, caption-rich web video. It is unclear whether these objectives survive in low-resource, code-switched, and noisy multilingual settings that typify developing regions. We show they do**-**and that the choice of aggregation function becomes even more critical. Using a multilingual subset of Project Vaani spanning dozens of Indian languages and dialectal variants, we compare three contrastive objectives: (i) a global mean-pooled loss (CLIP-style), (ii) a dense max-mean token matcher (DenseAV-style), and (iii) a simple hybrid (motivated by frozen-vision alignment strategies). The dense objective delivers a +59% relative R@1 (Audio Visual) improvement over global pooling and substantially lower mean/median ranks, while consistently producing sharp zero-shot localization heatmaps of spoken objects-despite keeping the vision backbone entirely frozen (no LoRA / partial fine-tuning). Our results demonstrate that dense token routing is not a luxury of high-resource English corpora; it is more decisive when annotations and acoustic cleanliness are scarce. We release the codebase and trained models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes