LGMLJul 16, 2021

Active learning for imbalanced data under cold start

arXiv:2107.07724v211 citations
Originality Incremental advance
AI Analysis

This addresses a practical issue for ML systems dealing with imbalanced data in cold-start scenarios, offering an incremental improvement over existing active learning methods.

The paper tackles the cold-start problem in imbalanced streaming data by proposing an active learning system with an outlier-based warm-up, achieving up to 80% gains over random sampling and competitive performance with only 2-10% of labels.

Modern systems that rely on Machine Learning (ML) for predictive modelling, may suffer from the cold-start problem: supervised models work well but, initially, there are no labels, which are costly or slow to obtain. This problem is even worse in imbalanced data scenarios, where labels of the positive class take longer to accumulate. We propose an Active Learning (AL) system for datasets with orders of magnitude of class imbalance, in a cold start streaming scenario. We present a computationally efficient Outlier-based Discriminative AL approach (ODAL) and design a novel 3-stage sequence of AL labeling policies where ODAL is used as warm-up. Then, we perform empirical studies in four real world datasets, with various magnitudes of class imbalance. The results show that our method can more quickly reach a high performance model than standard AL policies without ODAL warm-up. Its observed gains over random sampling can reach 80% and be competitive with policies with an unlimited annotation budget or additional historical data (using just 2% to 10% of the labels).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes