Joseph Anderson

LG
h-index47
5papers
108citations
Novelty64%
AI Score31

5 Papers

CVOct 13, 2024
Improving Colorectal Cancer Screening and Risk Assessment through Predictive Modeling on Medical Images and Records

Shuai Jiang, Christina Robinson, Joseph Anderson et al.

Colonoscopy screening effectively identifies and removes polyps before they progress to colorectal cancer (CRC), but current follow-up guidelines rely primarily on histopathological features, overlooking other important CRC risk factors. Variability in polyp characterization among pathologists also hinders consistent surveillance decisions. Advances in digital pathology and deep learning enable the integration of pathology slides and medical records for more accurate CRC risk prediction. Using data from the New Hampshire Colonoscopy Registry, including longitudinal follow-up, we adapted a transformer-based model for histopathology image analysis to predict 5-year CRC risk. We further explored multi-modal fusion strategies to combine clinical records with deep learning-derived image features. Training the model to predict intermediate clinical variables improved 5-year CRC risk prediction (AUC = 0.630) compared to direct prediction (AUC = 0.615, p = 0.013). Incorporating both imaging and non-imaging data, without requiring manual slide review, further improved performance (AUC = 0.674) compared to traditional features from colonoscopy and microscopy reports (AUC = 0.655, p = 0.001). These results highlight the value of integrating diverse data modalities with computational methods to enhance CRC risk stratification.

LGMay 25, 2017
Geometric Methods for Robust Data Analysis in High Dimension

Joseph Anderson

Machine learning and data analysis now finds both scientific and industrial application in biology, chemistry, geology, medicine, and physics. These applications rely on large quantities of data gathered from automated sensors and user input. Furthermore, the dimensionality of many datasets is extreme: more details are being gathered about single user interactions or sensor readings. All of these applications encounter problems with a common theme: use observed data to make inferences about the world. Our work obtains the first provably efficient algorithms for Independent Component Analysis (ICA) in the presence of heavy-tailed data. The main tool in this result is the centroid body (a well-known topic in convex geometry), along with optimization and random walks for sampling from a convex body. This is the first algorithmic use of the centroid body and it is of independent theoretical interest, since it effectively replaces the estimation of covariance from samples, and is more generally accessible. This reduction relies on a non-linear transformation of samples from such an intersection of halfspaces (i.e. a simplex) to samples which are approximately from a linearly transformed product distribution. Through this transformation of samples, which can be done efficiently, one can then use an ICA algorithm to recover the vertices of the intersection of halfspaces. Finally, we again use ICA as an algorithmic primitive to construct an efficient solution to the widely-studied problem of learning the parameters of a Gaussian mixture model. Our algorithm again transforms samples from a Gaussian mixture model into samples which fit into the ICA model and, when processed by an ICA algorithm, result in recovery of the mixture parameters. Our algorithm is effective even when the number of Gaussians in the mixture grows polynomially with the ambient dimension

LGFeb 22, 2017
Heavy-Tailed Analogues of the Covariance Matrix for ICA

Joseph Anderson, Navin Goyal, Anupama Nandi et al.

Independent Component Analysis (ICA) is the problem of learning a square matrix $A$, given samples of $X=AS$, where $S$ is a random vector with independent coordinates. Most existing algorithms are provably efficient only when each $S_i$ has finite and moderately valued fourth moment. However, there are practical applications where this assumption need not be true, such as speech and finance. Algorithms have been proposed for heavy-tailed ICA, but they are not practical, using random walks and the full power of the ellipsoid algorithm multiple times. The main contributions of this paper are: (1) A practical algorithm for heavy-tailed ICA that we call HTICA. We provide theoretical guarantees and show that it outperforms other algorithms in some heavy-tailed regimes, both on real and synthetic data. Like the current state-of-the-art, the new algorithm is based on the centroid body (a first moment analogue of the covariance matrix). Unlike the state-of-the-art, our algorithm is practically efficient. To achieve this, we use explicit analytic representations of the centroid body, which bypasses the use of the ellipsoid method and random walks. (2) We study how heavy tails affect different ICA algorithms, including HTICA. Somewhat surprisingly, we show that some algorithms that use the covariance matrix or higher moments can successfully solve a range of ICA instances with infinite second moment. We study this theoretically and experimentally, with both synthetic and real-world heavy-tailed data.

LGSep 2, 2015
Heavy-tailed Independent Component Analysis

Joseph Anderson, Navin Goyal, Anupama Nandi et al.

Independent component analysis (ICA) is the problem of efficiently recovering a matrix $A \in \mathbb{R}^{n\times n}$ from i.i.d. observations of $X=AS$ where $S \in \mathbb{R}^n$ is a random vector with mutually independent coordinates. This problem has been intensively studied, but all existing efficient algorithms with provable guarantees require that the coordinates $S_i$ have finite fourth moments. We consider the heavy-tailed ICA problem where we do not make this assumption, about the second moment. This problem also has received considerable attention in the applied literature. In the present work, we first give a provably efficient algorithm that works under the assumption that for constant $γ> 0$, each $S_i$ has finite $(1+γ)$-moment, thus substantially weakening the moment requirement condition for the ICA problem to be solvable. We then give an algorithm that works under the assumption that matrix $A$ has orthogonal columns but requires no moment assumptions. Our techniques draw ideas from convex geometry and exploit standard properties of the multivariate spherical Gaussian distribution in a novel way.

LGNov 12, 2013
The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures

Joseph Anderson, Mikhail Belkin, Navin Goyal et al.

In this paper we show that very large mixtures of Gaussians are efficiently learnable in high dimension. More precisely, we prove that a mixture with known identical covariance matrices whose number of components is a polynomial of any fixed degree in the dimension n is polynomially learnable as long as a certain non-degeneracy condition on the means is satisfied. It turns out that this condition is generic in the sense of smoothed complexity, as soon as the dimensionality of the space is high enough. Moreover, we prove that no such condition can possibly exist in low dimension and the problem of learning the parameters is generically hard. In contrast, much of the existing work on Gaussian Mixtures relies on low-dimensional projections and thus hits an artificial barrier. Our main result on mixture recovery relies on a new "Poissonization"-based technique, which transforms a mixture of Gaussians to a linear map of a product distribution. The problem of learning this map can be efficiently solved using some recent results on tensor decompositions and Independent Component Analysis (ICA), thus giving an algorithm for recovering the mixture. In addition, we combine our low-dimensional hardness results for Gaussian mixtures with Poissonization to show how to embed difficult instances of low-dimensional Gaussian mixtures into the ICA setting, thus establishing exponential information-theoretic lower bounds for underdetermined ICA in low dimension. To the best of our knowledge, this is the first such result in the literature. In addition to contributing to the problem of Gaussian mixture learning, we believe that this work is among the first steps toward better understanding the rare phenomenon of the "blessing of dimensionality" in the computational aspects of statistical inference.