Robust low-rank estimation with multiple binary responses using pairwise AUC loss

arXiv:2601.08618v1h-index: 9

Originality Incremental advance

AI Analysis

This work addresses statistical inefficiency in high-dimensional, class-imbalanced data analysis for researchers and practitioners, though it is incremental as it builds on existing low-rank models with a novel loss function.

The authors tackled the problem of modeling multiple binary responses by proposing a low-rank framework that directly optimizes for ranking performance using pairwise AUC loss, achieving robust results and outperforming likelihood-based methods in simulations.

Multiple binary responses arise in many modern data-analytic problems. Although fitting separate logistic regressions for each response is computationally attractive, it ignores shared structure and can be statistically inefficient, especially in high-dimensional and class-imbalanced regimes. Low-rank models offer a natural way to encode latent dependence across tasks, but existing methods for binary data are largely likelihood-based and focus on pointwise classification rather than ranking performance. In this work, we propose a unified framework for learning with multiple binary responses that directly targets discrimination by minimizing a surrogate loss for the area under the ROC curve (AUC). The method aggregates pairwise AUC surrogate losses across responses while imposing a low-rank constraint on the coefficient matrix to exploit shared structure. We develop a scalable projected gradient descent algorithm based on truncated singular value decomposition. Exploiting the fact that the pairwise loss depends only on differences of linear predictors, we simplify computation and analysis. We establish non-asymptotic convergence guarantees, showing that under suitable regularity conditions, leading to linear convergence up to the minimax-optimal statistical precision. Extensive simulation studies demonstrate that the proposed method is robust in challenging settings such as label switching and data contamination and consistently outperforms likelihood-based approaches.

View on arXiv PDF

Similar