Phase Transitions in the Detection of Correlated Databases
This work addresses a fundamental statistical problem relevant to fields like social media and computational biology, providing rigorous theoretical limits for correlation detection with unknown permutations.
The paper tackles the problem of detecting correlation between two Gaussian databases with unknown row permutations, determining sharp thresholds for optimal testing that exhibit phase transitions depending on the asymptotic regimes of n and d. It proves that weak detection is impossible when ρ²d→0, regardless of n, and strong detection is impossible for ρ<ρ* when d is fixed, closing gaps in recent studies.
We study the problem of detecting the correlation between two Gaussian databases $\mathsf{X}\in\mathbb{R}^{n\times d}$ and $\mathsf{Y}^{n\times d}$, each composed of $n$ users with $d$ features. This problem is relevant in the analysis of social media, computational biology, etc. We formulate this as a hypothesis testing problem: under the null hypothesis, these two databases are statistically independent. Under the alternative, however, there exists an unknown permutation $σ$ over the set of $n$ users (or, row permutation), such that $\mathsf{X}$ is $ρ$-correlated with $\mathsf{Y}^σ$, a permuted version of $\mathsf{Y}$. We determine sharp thresholds at which optimal testing exhibits a phase transition, depending on the asymptotic regime of $n$ and $d$. Specifically, we prove that if $ρ^2d\to0$, as $d\to\infty$, then weak detection (performing slightly better than random guessing) is statistically impossible, irrespectively of the value of $n$. This compliments the performance of a simple test that thresholds the sum all entries of $\mathsf{X}^T\mathsf{Y}$. Furthermore, when $d$ is fixed, we prove that strong detection (vanishing error probability) is impossible for any $ρ<ρ^\star$, where $ρ^\star$ is an explicit function of $d$, while weak detection is again impossible as long as $ρ^2d\to0$. These results close significant gaps in current recent related studies.