Reinforced Label Denoising for Weakly-Supervised Audio-Visual Video Parsing
This work addresses label noise in weakly-supervised audio-visual video parsing, which is an incremental improvement over existing methods.
The paper tackles the problem of audio-visual video parsing (AVVP) by addressing label noise in weakly-supervised settings, proposing a reinforcement learning-based label denoising approach (RLLD) that jointly optimizes denoising and parsing. The method achieves superior performance compared to existing label denoising techniques and enhances other AVVP models when incorporated.
Audio-visual video parsing (AVVP) aims to recognize audio and visual event labels with precise temporal boundaries, which is quite challenging since audio or visual modality might include only one event label with only the overall video labels available. Existing label denoising models often treat the denoising process as a separate preprocessing step, leading to a disconnect between label denoising and AVVP tasks. To bridge this gap, we present a novel joint reinforcement learning-based label denoising approach (RLLD). This approach enables simultaneous training of both label denoising and video parsing models through a joint optimization strategy. We introduce a novel AVVP-validation and soft inter-reward feedback mechanism that directly guides the learning of label denoising policy. Extensive experiments on AVVP tasks demonstrate the superior performance of our proposed method compared to label denoising techniques. Furthermore, by incorporating our label denoising method into other AVVP models, we find that it can further enhance parsing results.