Mitigating shortage of labeled data using clustering-based active learning with diversity exploration
This work addresses the problem of data labeling inefficiency for machine learning practitioners, but it appears incremental as it builds on existing active learning methods.
The paper tackles the problem of labeled data shortage by proposing a clustering-based active learning framework called ALCS, which uses density-based clustering and a bi-cluster boundary query procedure to improve classification performance for overlapped classes, with experimental results justifying its efficacy.
In this paper, we proposed a new clustering-based active learning framework, namely Active Learning using a Clustering-based Sampling (ALCS), to address the shortage of labeled data. ALCS employs a density-based clustering approach to explore the cluster structure from the data without requiring exhaustive parameter tuning. A bi-cluster boundary-based sample query procedure is introduced to improve the learning performance for classifying highly overlapped classes. Additionally, we developed an effective diversity exploration strategy to address the redundancy among queried samples. Our experimental results justified the efficacy of the ALCS approach.