Distance Shrinkage and Euclidean Embedding via Regularized Kernel Estimation
This work addresses a common practical issue in data analysis, such as visualizing protein sequence diversity, but is incremental as it builds upon existing distance estimation methods.
The paper tackles the problem of recovering Euclidean distance matrices from noisy observations by proposing a regularized kernel estimate that applies constant shrinkage to all observed pairwise distances, achieving consistent estimation of true distances as the number of objects increases.
Although recovering an Euclidean distance matrix from noisy observations is a common problem in practice, how well this could be done remains largely unknown. To fill in this void, we study a simple distance matrix estimate based upon the so-called regularized kernel estimate. We show that such an estimate can be characterized as simply applying a constant amount of shrinkage to all observed pairwise distances. This fact allows us to establish risk bounds for the estimate implying that the true distances can be estimated consistently in an average sense as the number of objects increases. In addition, such a characterization suggests an efficient algorithm to compute the distance matrix estimator, as an alternative to the usual second order cone programming known not to scale well for large problems. Numerical experiments and an application in visualizing the diversity of Vpu protein sequences from a recent HIV-1 study further demonstrate the practical merits of the proposed method.