A Multivariate Bernoulli-Based Sampling Method for Multi-Label Data with Application to Meta-Research
This addresses the problem of biased sampling in meta-research for researchers analyzing multi-label datasets with imbalanced and dependent categories, though it is incremental as it builds on existing multivariate Bernoulli methods.
The paper tackled the challenge of sampling from multi-label data with imbalanced and dependent labels by proposing a novel multivariate Bernoulli-based sampling algorithm that accounts for label dependencies. The result was a more balanced sub-sample that enhanced representation of minority categories, as demonstrated on a dataset of research articles with 64 biomedical topics.
Datasets may contain observations with multiple labels. If the labels are not mutually exclusive, and if the labels vary greatly in frequency, obtaining a sample that includes sufficient observations with scarcer labels to make inferences about those labels, and which deviates from the population frequencies in a known manner, creates challenges. In this paper, we consider a multivariate Bernoulli distribution as our underlying distribution of a multi-label problem. We present a novel sampling algorithm that takes label dependencies into account. It uses observed label frequencies to estimate multivariate Bernoulli distribution parameters and calculate weights for each label combination. This approach ensures the weighted sampling acquires target distribution characteristics while accounting for label dependencies. We applied this approach to a sample of research articles from Web of Science labeled with 64 biomedical topic categories. We aimed to preserve category frequency order, reduce frequency differences between most and least common categories, and account for category dependencies. This approach produced a more balanced sub-sample, enhancing the representation of minority categories.