Van Vu

IV
h-index1
7papers
252citations
Novelty53%
AI Score45

7 Papers

NAApr 12, 2010
Singular vectors under random perturbation

Van Vu

Computing the first few singular vectors of a large matrix is a problem that frequently comes up in statistics and numerical analysis. Given the presence of noise, exact calculation is hard to achieve, and the following problem is of importance: \vskip2mm \centerline {\it How much a small perturbation to the matrix changes the singular vectors ?} \vskip2mm Answering this question, classical theorems, such as those of Davis-Kahan and Wedin, give tight estimates for the worst-case scenario. In this paper, we show that if the perturbation (noise) is random and our matrix has low rank, then better estimates can be obtained. Our method relies on high dimensional geometry and is different from those used an earlier papers.

IVMar 20, 2022
VinDr-Mammo: A large-scale benchmark dataset for computer-aided diagnosis in full-field digital mammography

Hieu T. Nguyen, Ha Q. Nguyen, Hieu H. Pham et al.

Mammography, or breast X-ray, is the most widely used imaging modality to detect cancer and other breast diseases. Recent studies have shown that deep learning-based computer-assisted detection and diagnosis (CADe or CADx) tools have been developed to support physicians and improve the accuracy of interpreting mammography. However, most published datasets of mammography are either limited on sample size or digitalized from screen-film mammography (SFM), hindering the development of CADe and CADx tools which are developed based on full-field digital mammography (FFDM). To overcome this challenge, we introduce VinDr-Mammo - a new benchmark dataset of FFDM for detecting and diagnosing breast cancer and other diseases in mammography. The dataset consists of 5,000 mammography exams, each of which has four standard views and is double read with disagreement (if any) being resolved by arbitration. It is created for the assessment of Breast Imaging Reporting and Data System (BI-RADS) and density at the breast level. In addition, the dataset also provides the category, location, and BI-RADS assessment of non-benign findings. We make VinDr-Mammo publicly available on PhysioNet as a new imaging resource to promote advances in developing CADe and CADx tools for breast cancer screening.

NAMar 16
New perturbation bounds for low rank approximation of matrices: Beyond Eckart-Young-Mirsky

Phuc Tran, Van Vu

Let $A$ be an $m \times n$ matrix with rank $r$ and spectral decomposition $A = \sum_{i=1}^r σ_i u_i v_i^\top,$ where $σ_i$ are its singular values, ordered decreasingly, and $u_i, v_i$ are the corresponding left and right singular vectors. For a parameter $1 \le p \le r$, $A_p := \sum_{i=1}^p σ_i u_i v_i^\top$ is the best rank $p$ approximation of $A$. In practice, one often chooses $p$ to be small, leading to the commonly used phrase "low-rank approximation". Low-rank approximation plays a central role in data science because it can substantially reduce the dimensionality of the original data, the matrix $A$. For a large data matrix $A$, one typically computes a rank-$p$ approximation $A_p$ for a suitably chosen small $p$, stores $A_p$, and uses it as input for further computations. The reduced dimension of $A_p$ enables faster computations and significant data compression. In practice, noise is inevitable. We often have access only to noisy data $\tilde A = A + E$, where $E$ represents the noise. Consequently, the low-rank approximation used as input in many downstream tasks is $\tilde A_p$, the best rank $p$ approximation of $\tilde A$, rather than $A_p$. Therefore, it is natural and important to estimate the error $ \| \tilde A_p - A_p \|$. This error plays a critical role in estimating the accuracy of the output of any process involving a low-rank approximation of noisy input. In this paper, we develop a new method (based on contour analysis) to bound $\| \tilde A_p - A_p \|$. With this method, we can exploit new parameters that measure the skewness between the noise matrix $E$ and the singular vectors of $A$, avoiding the worst-case analysis used in traditional approaches. In many settings, we obtain notable quantitative improvements compared to classical approaches (using the Eckart-Young-Mirsky theorem or the Davis-Kahan theorem).

NAMar 20
Eigenvalue Stability and New Perturbation Bounds for the extremal eigenvalues of a matrix

Phuc Tran, Van Vu

Let $A$ be a full ranked $ n\times n$ matrix, with singular values $σ_1 (A) \ge \dots \ge σ_n (A) >0$. The condition number $κ(A):= σ_1(A)/σ_n(A)=\|A\|\cdot \|A\|^{-1}$ is a key parameter in the analysis of algorithms taking $A$ as input. In practice, matrices (representing real data) are often perturbed by noise. Technically speaking, the real input would be a noisy variant $\tilde A =A +E$ of $A$, where $E$ represents the noise. The condition number $κ(\tilde A)$ will be used instead of $κ(A)$. Thus, it is of importance to measure the impact of noise on the condition number. In this paper, we focus on the case when the noise is random. We introduce the notion of regional stability, via which we design a new framework to estimate the perturbation of the extremal singular values and the condition number of a matrix. Our framework allows us to bound the perturbation of singular values through the perturbation of singular spaces. We then bound the latter using a novel contour analysis argument, which, as a co-product, provides an improved version of the classical Davis-Kahan theorem in many settings. Our new estimates concerning the least singular value $σ_n(A)$ complement well-known results in this area, and are more favorable in the case when the ground matrix $A$ is large compared to the noise matrix $E$.

STJan 31, 2025
Fast exact recovery of noisy matrix from few entries: the infinity norm approach

BaoLinh Tran, Van Vu

The matrix recovery (completion) problem, a central problem in data science and theoretical computer science, is to recover a matrix $A$ from a relatively small sample of entries. While such a task is impossible in general, it has been shown that one can recover $A$ exactly in polynomial time, with high probability, from a random subset of entries, under three (basic and necessary) assumptions: (1) the rank of $A$ is very small compared to its dimensions (low rank), (2) $A$ has delocalized singular vectors (incoherence), and (3) the sample size is sufficiently large. There are many different algorithms for the task, including convex optimization by Candes, Tao and Recht (2009), alternating projection by Hardt and Wooters (2014) and low rank approximation with gradient descent by Keshavan, Montanari and Oh (2009, 2010). In applications, it is more realistic to assume that data is noisy. In this case, these approaches provide an approximate recovery with small root mean square error. However, it is hard to transform such an approximate recovery to an exact one. Recently, results by Abbe et al. (2017) and Bhardwaj et al. (2023) concerning approximation in the infinity norm showed that we can achieve exact recovery even in the noisy case, given that the ground matrix has bounded precision. Beyond the three basic assumptions above, they required either the condition number of $A$ is small (Abbe et al.) or the gap between consecutive singular values is large (Bhardwaj et al.). In this paper, we remove these extra spectral assumptions. As a result, we obtain a simple algorithm for exact recovery in the noisy case, under only the three basic assumptions. This is the first such algorithm. To analyse this algorithm, we introduce a contour integration argument which is totally different from all previous methods and may be of independent interest.

IVJun 24, 2021
VinDr-SpineXR: A deep learning framework for spinal lesions detection and classification from radiographs

Hieu T. Nguyen, Hieu H. Pham, Nghia T. Nguyen et al.

Radiographs are used as the most important imaging tool for identifying spine anomalies in clinical practice. The evaluation of spinal bone lesions, however, is a challenging task for radiologists. This work aims at developing and evaluating a deep learning-based framework, named VinDr-SpineXR, for the classification and localization of abnormalities from spine X-rays. First, we build a large dataset, comprising 10,468 spine X-ray images from 5,000 studies, each of which is manually annotated by an experienced radiologist with bounding boxes around abnormal findings in 13 categories. Using this dataset, we then train a deep learning classifier to determine whether a spine scan is abnormal and a detector to localize 7 crucial findings amongst the total 13. The VinDr-SpineXR is evaluated on a test set of 2,078 images from 1,000 studies, which is kept separate from the training set. It demonstrates an area under the receiver operating characteristic curve (AUROC) of 88.61% (95% CI 87.19%, 90.02%) for the image-level classification task and a mean average precision (mAP@0.5) of 33.56% for the lesion-level localization task. These results serve as a proof of concept and set a baseline for future research in this direction. To encourage advances, the dataset, codes, and trained deep learning models are made publicly available.

MLMar 2, 2018
Matrices with Gaussian noise: optimal estimates for singular subspace perturbation

Sean O'Rourke, Van Vu, Ke Wang

The Davis-Kahan-Wedin $\sin Θ$ theorem describes how the singular subspaces of a matrix change when subjected to a small perturbation. This classic result is sharp in the worst case scenario. In this paper, we prove a stochastic version of the Davis-Kahan-Wedin $\sin Θ$ theorem when the perturbation is a Gaussian random matrix. Under certain structural assumptions, we obtain an optimal bound that significantly improves upon the classic Davis-Kahan-Wedin $\sin Θ$ theorem. One of our key tools is a new perturbation bound for the singular values, which may be of independent interest.