How Much Does Audio Matter to Recognize Egocentric Object Interactions?
This work addresses the under-explored use of audio for egocentric action recognition, which could benefit applications in assistive technologies or human-computer interaction, though it is incremental as it builds on existing benchmarks.
The paper tackled the problem of recognizing egocentric object interactions by proposing an audio-only model, achieving a competitive verb classification accuracy of 34.26% on a standard benchmark compared to vision-based systems with a lighter architecture.
Sounds are an important source of information on our daily interactions with objects. For instance, a significant amount of people can discern the temperature of water that it is being poured just by using the sense of hearing. However, only a few works have explored the use of audio for the classification of object interactions in conjunction with vision or as single modality. In this preliminary work, we propose an audio model for egocentric action recognition and explore its usefulness on the parts of the problem (noun, verb, and action classification). Our model achieves a competitive result in terms of verb classification (34.26% accuracy) on a standard benchmark with respect to vision-based state of the art systems, using a comparatively lighter architecture.