STJan 12, 2023
Detection problems in the spiked matrix modelsJi Hyung Jung, Hye Won Chung, Ji Oon Lee
We study the statistical decision process of detecting the low-rank signal from various signal-plus-noise type data matrices, known as the spiked random matrix models. We first show that the principal component analysis can be improved by entrywise pre-transforming the data matrix if the noise is non-Gaussian, generalizing the known results for the spiked random matrix models with rank-1 signals. As an intermediate step, we find out sharp phase transition thresholds for the extreme eigenvalues of spiked random matrices, which generalize the Baik-Ben Arous-Péché (BBP) transition. We also prove the central limit theorem for the linear spectral statistics for the spiked random matrices and propose a hypothesis test based on it, which does not depend on the distribution of the signal or the noise. When the noise is non-Gaussian noise, the test can be improved with an entrywise transformation to the data matrix with additive noise. We also introduce an algorithm that estimates the rank of the signal when it is not known a priori.
STApr 28, 2021
Detection of Signal in the Spiked Rectangular ModelsJi Hyung Jung, Hye Won Chung, Ji Oon Lee
We consider the problem of detecting signals in the rank-one signal-plus-noise data matrix models that generalize the spiked Wishart matrices. We show that the principal component analysis can be improved by pre-transforming the matrix entries if the noise is non-Gaussian. As an intermediate step, we prove a sharp phase transition of the largest eigenvalues of spiked rectangular matrices, which extends the Baik-Ben Arous-Péché (BBP) transition. We also propose a hypothesis test to detect the presence of signal with low computational complexity, based on the linear spectral statistics, which minimizes the sum of the Type-I and Type-II errors when the noise is Gaussian.
STJan 16, 2020
Weak Detection in the Spiked Wigner Model with General RankJi Hyung Jung, Hye Won Chung, Ji Oon Lee
We study the statistical decision process of detecting the signal from a `signal+noise' type matrix model with an additive Wigner noise. We propose a hypothesis test based on the linear spectral statistics of the data matrix, which does not depend on the distribution of the signal or the noise. The test is optimal under the Gaussian noise if the signal-to-noise ratio is small, as it minimizes the sum of the Type-I and Type-II errors. Under the non-Gaussian noise, the test can be improved with an entrywise transformation to the data matrix. We also introduce an algorithm that estimates the rank of the signal when it is not known a priori.
STSep 28, 2018
Weak detection in the spiked Wigner modelHye Won Chung, Ji Oon Lee
We consider the weak detection problem in a rank-one spiked Wigner data matrix where the signal-to-noise ratio is small so that reliable detection is impossible. We propose a hypothesis test on the presence of the signal by utilizing the linear spectral statistics of the data matrix. The test is data-driven and does not require prior knowledge about the distribution of the signal or the noise. When the noise is Gaussian, the proposed test is optimal in the sense that its error matches that of the likelihood ratio test, which minimizes the sum of the Type-I and Type-II errors. If the density of the noise is known and non-Gaussian, the error of the test can be lowered by applying an entrywise transformation to the data matrix. We establish a central limit theorem for the linear spectral statistics of general rank-one spiked Wigner matrices as an intermediate step.
ITSep 4, 2018
Parity Queries for Binary ClassificationHye Won Chung, Ji Oon Lee, Doyeon Kim et al.
Consider a query-based data acquisition problem that aims to recover the values of $k$ binary variables from parity (XOR) measurements of chosen subsets of the variables. Assume the response model where only a randomly selected subset of the measurements is received. We propose a method for designing a sequence of queries so that the variables can be identified with high probability using as few ($n$) measurements as possible. We define the query difficulty $\bar{d}$ as the average size of the query subsets and the sample complexity $n$ as the minimum number of measurements required to attain a given recovery accuracy. We obtain fundamental trade-offs between recovery accuracy, query difficulty, and sample complexity. In particular, the necessary and sufficient sample complexity required for recovering all $k$ variables with high probability is $n = c_0 \max\{k, (k \log k)/\bar{d}\}$ and the sample complexity for recovering a fixed proportion $(1-δ)k$ of the variables for $δ=o(1)$ is $n = c_1\max\{k, (k \log(1/δ))/\bar{d}\}$, where $c_0, c_1>0$.