Pool-Based Sequential Active Learning for Regression
This work addresses the challenge of efficient data labeling for regression tasks, offering an incremental improvement over existing active learning methods.
The paper tackles the problem of reducing data labeling effort in regression by proposing a new pool-based sequential active learning approach that considers representativeness and diversity, and integrates with existing methods, showing effectiveness across 11 diverse datasets.
Active learning is a machine learning approach for reducing the data labeling effort. Given a pool of unlabeled samples, it tries to select the most useful ones to label so that a model built from them can achieve the best possible performance. This paper focuses on pool-based sequential active learning for regression (ALR). We first propose three essential criteria that an ALR approach should consider in selecting the most useful unlabeled samples: informativeness, representativeness, and diversity, and compare four existing ALR approaches against them. We then propose a new ALR approach using passive sampling, which considers both the representativeness and the diversity in both the initialization and subsequent iterations. Remarkably, this approach can also be integrated with other existing ALR approaches in the literature to further improve the performance. Extensive experiments on 11 UCI, CMU StatLib, and UFL Media Core datasets from various domains verified the effectiveness of our proposed ALR approaches.