Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision
This work addresses audiovisual object extraction in large video collections using weak supervision, which is incremental as it builds upon a previous framework.
The paper tackles audiovisual scene analysis for weakly-labeled video data by integrating audio source enhancement into an existing framework, achieving object classification in noisy acoustic environments and showing encouraging visual object localization results on a music instrument performance dataset.
We tackle the problem of audiovisual scene analysis for weakly-labeled data. To this end, we build upon our previous audiovisual representation learning framework to perform object classification in noisy acoustic environments and integrate audio source enhancement capability. This is made possible by a novel use of non-negative matrix factorization for the audio modality. Our approach is founded on the multiple instance learning paradigm. Its effectiveness is established through experiments over a challenging dataset of music instrument performance videos. We also show encouraging visual object localization results.