Optimal Estimator for Linear Regression with Shuffled Labels
This work addresses a specific linear regression challenge with shuffled labels, offering incremental improvements in estimation methods for this domain.
The paper tackles the problem of linear regression with shuffled labels by proposing a one-step estimator to reconstruct the unknown permutation and signal matrix, achieving computational complexity of O(n^3 + np^2m) and providing sufficient conditions for correct permutation recovery across different signal-to-noise ratio regimes, such as snr ≥ Ω(1) in the easy regime.
This paper considers the task of linear regression with shuffled labels, i.e., $\mathbf Y = \mathbf Π\mathbf X \mathbf B + \mathbf W$, where $\mathbf Y \in \mathbb R^{n\times m}, \mathbf Pi \in \mathbb R^{n\times n}, \mathbf X\in \mathbb R^{n\times p}, \mathbf B \in \mathbb R^{p\times m}$, and $\mathbf W\in \mathbb R^{n\times m}$, respectively, represent the sensing results, (unknown or missing) corresponding information, sensing matrix, signal of interest, and additive sensing noise. Given the observation $\mathbf Y$ and sensing matrix $\mathbf X$, we propose a one-step estimator to reconstruct $(\mathbf Π, \mathbf B)$. From the computational perspective, our estimator's complexity is $O(n^3 + np^2m)$, which is no greater than the maximum complexity of a linear assignment algorithm (e.g., $O(n^3)$) and a least square algorithm (e.g., $O(np^2 m)$). From the statistical perspective, we divide the minimum $snr$ requirement into four regimes, e.g., unknown, hard, medium, and easy regimes; and present sufficient conditions for the correct permutation recovery under each regime: $(i)$ $snr \geq Ω(1)$ in the easy regime; $(ii)$ $snr \geq Ω(\log n)$ in the medium regime; and $(iii)$ $snr \geq Ω((\log n)^{c_0}\cdot n^{{c_1}/{srank(\mathbf B)}})$ in the hard regime ($c_0, c_1$ are some positive constants and $srank(\mathbf B)$ denotes the stable rank of $\mathbf B$). In the end, we also provide numerical experiments to confirm the above claims.