LG DS MLAug 12, 2016

Chi-squared Amplification: Identifying Hidden Hubs

arXiv:1608.03643v23.53 citations

Originality Highly original

AI Analysis

This addresses the problem of efficiently detecting hidden structures in random matrices for applications in data analysis and machine learning, representing a theoretical advance beyond known barriers.

The paper tackles the hidden hubs model, a generalization of the planted Gaussian submatrix problem, by developing a polynomial-time algorithm that identifies hidden hubs with high probability for k ≥ n^{0.5-δ} when σ₁² > 2σ₀², and shows a nearly matching lower bound for σ₁² ≤ 2σ₀², with a critical improvement to k ≥ c√n(ln n)^{1/4} at σ₁² = 2σ₀².

We consider the following general hidden hubs model: an $n \times n$ random matrix $A$ with a subset $S$ of $k$ special rows (hubs): entries in rows outside $S$ are generated from the probability distribution $p_0 \sim N(0,σ_0^2)$; for each row in $S$, some $k$ of its entries are generated from $p_1 \sim N(0,σ_1^2)$, $σ_1>σ_0$, and the rest of the entries from $p_0$. The problem is to identify the high-degree hubs efficiently. This model includes and significantly generalizes the planted Gaussian Submatrix Model, where the special entries are all in a $k \times k$ submatrix. There are two well-known barriers: if $k\geq c\sqrt{n\ln n}$, just the row sums are sufficient to find $S$ in the general model. For the submatrix problem, this can be improved by a $\sqrt{\ln n}$ factor to $k \ge c\sqrt{n}$ by spectral methods or combinatorial methods. In the variant with $p_0=\pm 1$ (with probability $1/2$ each) and $p_1\equiv 1$, neither barrier has been broken. We give a polynomial-time algorithm to identify all the hidden hubs with high probability for $k \ge n^{0.5-δ}$ for some $δ>0$, when $σ_1^2>2σ_0^2$. The algorithm extends to the setting where planted entries might have different variances each at least as large as $σ_1^2$. We also show a nearly matching lower bound: for $σ_1^2 \le 2σ_0^2$, there is no polynomial-time Statistical Query algorithm for distinguishing between a matrix whose entries are all from $N(0,σ_0^2)$ and a matrix with $k=n^{0.5-δ}$ hidden hubs for any $δ>0$. The lower bound as well as the algorithm are related to whether the chi-squared distance of the two distributions diverges. At the critical value $σ_1^2=2σ_0^2$, we show that the general hidden hubs problem can be solved for $k\geq c\sqrt n(\ln n)^{1/4}$, improving on the naive row sum-based method.

View on arXiv PDF

Similar