CVOct 2, 2025

Clink! Chop! Thud! -- Learning Object Sounds from Real-World Interactions

arXiv:2510.02313v1h-index: 1
Originality Incremental advance
AI Analysis

This addresses the challenge of object-aware sound recognition for applications in robotics and human-computer interaction, though it is incremental in building on multimodal learning techniques.

The paper tackles the problem of linking sounds to objects involved in real-world interactions by introducing a sounding object detection task, achieving state-of-the-art performance on this new task and existing multimodal action understanding tasks.

Can a model distinguish between the sound of a spoon hitting a hardwood floor versus a carpeted one? Everyday object interactions produce sounds unique to the objects involved. We introduce the sounding object detection task to evaluate a model's ability to link these sounds to the objects directly involved. Inspired by human perception, our multimodal object-aware framework learns from in-the-wild egocentric videos. To encourage an object-centric approach, we first develop an automatic pipeline to compute segmentation masks of the objects involved to guide the model's focus during training towards the most informative regions of the interaction. A slot attention visual encoder is used to further enforce an object prior. We demonstrate state of the art performance on our new task along with existing multimodal action understanding tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes