Open-Sampling: Exploring Out-of-Distribution data for Re-balancing Long-tailed datasets
This addresses class imbalance in training datasets for machine learning practitioners, offering a novel approach to improve model generalization, though it builds incrementally on prior semi-supervised techniques.
The paper tackles the problem of poor performance of deep neural networks on long-tailed datasets by proposing Open-sampling, a method that uses out-of-distribution data to re-balance class priors, which significantly outperforms existing re-balancing methods and boosts state-of-the-art performance.
Deep neural networks usually perform poorly when the training dataset suffers from extreme class imbalance. Recent studies found that directly training with out-of-distribution data (i.e., open-set samples) in a semi-supervised manner would harm the generalization performance. In this work, we theoretically show that out-of-distribution data can still be leveraged to augment the minority classes from a Bayesian perspective. Based on this motivation, we propose a novel method called Open-sampling, which utilizes open-set noisy labels to re-balance the class priors of the training dataset. For each open-set instance, the label is sampled from our pre-defined distribution that is complementary to the distribution of original class priors. We empirically show that Open-sampling not only re-balances the class priors but also encourages the neural network to learn separable representations. Extensive experiments demonstrate that our proposed method significantly outperforms existing data re-balancing methods and can boost the performance of existing state-of-the-art methods.