ML LGNov 25, 2022

Doubly robust nearest neighbors in factor models

Raaz Dwivedi, Katherine Tian, Sabina Tomkins, Predrag Klasnja, Susan Murphy, Devavrat Shah

HarvardMIT

arXiv:2211.14297v311.611 citationsh-index: 63

Originality Incremental advance

AI Analysis

This work addresses a specific bottleneck in matrix completion for researchers and practitioners, offering an incremental improvement over prior NN strategies.

The paper tackles the problem of poor performance in nearest neighbors (NN) estimation for missing data in latent factor models when similar rows or columns are unavailable, by introducing a doubly robust NN variant that provides consistent estimates if either good row or column neighbors exist and offers a near-quadratic improvement in error and narrower confidence intervals when both exist.

We introduce and analyze an improved variant of nearest neighbors (NN) for estimation with missing data in latent factor models. We consider a matrix completion problem with missing data, where the $(i, t)$-th entry, when observed, is given by its mean $f(u_i, v_t)$ plus mean-zero noise for an unknown function $f$ and latent factors $u_i$ and $v_t$. Prior NN strategies, like unit-unit NN, for estimating the mean $f(u_i, v_t)$ relies on existence of other rows $j$ with $u_j \approx u_i$. Similarly, time-time NN strategy relies on existence of columns $t'$ with $v_{t'} \approx v_t$. These strategies provide poor performance respectively when similar rows or similar columns are not available. Our estimate is doubly robust to this deficit in two ways: (1) As long as there exist either good row or good column neighbors, our estimate provides a consistent estimate. (2) Furthermore, if both good row and good column neighbors exist, it provides a (near-)quadratic improvement in the non-asymptotic error and admits a significantly narrower asymptotic confidence interval when compared to both unit-unit or time-time NN.

View on arXiv PDF

Similar