ML LG APAug 14, 2024

Ranking and Combining Latent Structured Predictive Scores without Labeled Data

Shiva Afshar, Yinghan Chen, Shizhong Han, Ying Lin

arXiv:2408.07796v13.11 citationsh-index: 37

Originality Highly original

AI Analysis

This addresses the challenge of ensemble learning in scenarios where labeled data is scarce and predictors are correlated, offering a solution for domains like bioinformatics.

The paper tackles the problem of combining multiple predictors from distributed data sources without labeled data, introducing a structured unsupervised ensemble learning model (SUEL) that ranks and weights predictors based on their dependencies, achieving efficient integration in simulation studies and a real-world risk gene discovery application.

Combining multiple predictors obtained from distributed data sources to an accurate meta-learner is promising to achieve enhanced performance in lots of prediction problems. As the accuracy of each predictor is usually unknown, integrating the predictors to achieve better performance is challenging. Conventional ensemble learning methods assess the accuracy of predictors based on extensive labeled data. In practical applications, however, the acquisition of such labeled data can prove to be an arduous task. Furthermore, the predictors under consideration may exhibit high degrees of correlation, particularly when similar data sources or machine learning algorithms were employed during their model training. In response to these challenges, this paper introduces a novel structured unsupervised ensemble learning model (SUEL) to exploit the dependency between a set of predictors with continuous predictive scores, rank the predictors without labeled data and combine them to an ensembled score with weights. Two novel correlation-based decomposition algorithms are further proposed to estimate the SUEL model, constrained quadratic optimization (SUEL.CQO) and matrix-factorization-based (SUEL.MF) approaches. The efficacy of the proposed methods is rigorously assessed through both simulation studies and real-world application of risk genes discovery. The results compellingly demonstrate that the proposed methods can efficiently integrate the dependent predictors to an ensemble model without the need of ground truth data.

View on arXiv PDF

Similar