Deep Multiple Instance Feature Learning via Variational Autoencoder
This work addresses MIL problems for applications like audio event detection, but it is incremental as it builds on existing VAE and MIL methods.
The paper tackles the challenge of uncertain positive instance labels in multiple instance learning (MIL) by proposing a weakly supervised framework that combines discriminative and generative models, resulting in better performance on standard benchmarks and scalability to large datasets like audio event detection.
We describe a novel weakly supervised deep learning framework that combines both the discriminative and generative models to learn meaningful representation in the multiple instance learning (MIL) setting. MIL is a weakly supervised learning problem where labels are associated with groups of instances (referred as bags) instead of individual instances. To address the essential challenge in MIL problems raised from the uncertainty of positive instances label, we use a discriminative model regularized by variational autoencoders (VAEs) to maximize the differences between latent representations of all instances and negative instances. As a result, the hidden layer of the variational autoencoder learns meaningful representation. This representation can effectively be used for MIL problems as illustrated by better performance on the standard benchmark datasets comparing to the state-of-the-art approaches. More importantly, unlike most related studies, the proposed framework can be easily scaled to large dataset problems, as illustrated by the audio event detection and segmentation task. Visualization also confirms the effectiveness of the latent representation in discriminating positive and negative classes.