MENov 13, 2022
Elliptically-Contoured Tensor-variate Distributions with Application to Improved Image LearningCarlos Llosa-Vite, Ranjan Maitra
Statistical analysis of tensor-valued data has largely used the tensor-variate normal (TVN) distribution that may be inadequate when data comes from distributions with heavier or lighter tails. We study a general family of elliptically contoured (EC) tensor-variate distributions and derive its characterizations, moments, marginal and conditional distributions, and the EC Wishart distribution. We describe procedures for maximum likelihood estimation from data that are (1) uncorrelated draws from an EC distribution, (2) from a scale mixture of the TVN distribution, and (3) from an underlying but unknown EC distribution, where we extend Tyler's robust estimator. A detailed simulation study highlights the benefits of choosing an EC distribution over the TVN for heavier-tailed data. We develop tensor-variate classification rules using discriminant analysis and EC errors and show that they better predict cats and dogs from images in the Animal Faces-HQ dataset than the TVN-based rules. A novel tensor-on-tensor regression and tensor-variate analysis of variance (TANOVA) framework under EC errors is also demonstrated to better characterize gender, age and ethnic origin than the usual TVN-based TANOVA in the celebrated Labeled Faces of the Wild dataset.
MEMar 30, 2021Code
Fast model-based clustering of partial recordsEmily M. Goren, Ranjan Maitra
Partially recorded data are frequently encountered in many applications and usually clustered by first removing incomplete cases or features with missing values, or by imputing missing values, followed by application of a clustering algorithm to the resulting altered dataset. Here, we develop clustering methodology through a model-based approach using the marginal density for the observed values, assuming a finite mixture model of multivariate $t$ distributions. We compare our approximate algorithm to the corresponding full expectation-maximization (EM) approach that considers the missing values in the incomplete data set and makes a missing at random (MAR) assumption, as well as case deletion and imputation methods. Since only the observed values are utilized, our approach is computationally more efficient than imputation or full EM. Simulation studies demonstrate that our approach has favorable recovery of the true cluster partition compared to case deletion and imputation under various missingness mechanisms, and is at least competitive with the full EM approach, even when MAR assumptions are violated. Our methodology is demonstrated on a problem of clustering gamma-ray bursts and is implemented at https://github.com/emilygoren/MixtClust.
IVJan 27
Optimized $k$-means color quantization of digital images in machine-based and human perception-based colorspacesRanjan Maitra
Color quantization represents an image using a fraction of its original number of colors while only minimally losing its visual quality. The $k$-means algorithm is commonly used in this context, but has mostly been applied in the machine-based RGB colorspace composed of the three primary colors. However, some recent studies have indicated its improved performance in human perception-based colorspaces. We investigated the performance of $k$-means color quantization at four quantization levels in the RGB, CIE-XYZ, and CIE-LUV/CIE-HCL colorspaces, on 148 varied digital images spanning a wide range of scenes, subjects and settings. The Visual Information Fidelity (VIF) measure numerically assessed the quality of the quantized images, and showed that in about half of the cases, $k$-means color quantization is best in the RGB space, while at other times, and especially for higher quantization levels ($k$), the CIE-XYZ colorspace is where it usually does better. There are also some cases, especially at lower $k$, where the best performance is obtained in the CIE-LUV colorspace. Further analysis of the performances in terms of the distributions of the hue, chromaticity and luminance in an image presents a nuanced perspective and characterization of the images for which each colorspace is better for $k$-means color quantization.
MENov 9, 2021
Exploratory Factor Analysis of Data on a SphereFan Dai, Karin S. Dorman, Somak Dutta et al.
Data on high-dimensional spheres arise frequently in many disciplines either naturally or as a consequence of preliminary processing and can have intricate dependence structure that needs to be understood. We develop exploratory factor analysis of the projected normal distribution to explain the variability in such data using a few easily interpreted latent factors. Our methodology provides maximum likelihood estimates through a novel fast alternating expectation profile conditional maximization algorithm. Results on simulation experiments on a wide range of settings are uniformly excellent. Our methodology provides interpretable and insightful results when applied to tweets with the $\#MeToo$ hashtag in early December 2018, to time-course functional Magnetic Resonance Images of the average pre-teen brain at rest, to characterize handwritten digits, and to gene expression data from cancerous cells in the Cancer Genome Atlas.
MEOct 19, 2021
Fully Three-dimensional Radial VisualizationYifan Zhu, Fan Dai, Ranjan Maitra
We develop methodology for three-dimensional (3D) radial visualization (RadViz) of multidimensional datasets. The classical two-dimensional (2D) RadViz visualizes multivariate data in the 2D plane by mapping every observation to a point inside the unit circle. Our tool, RadViz3D, distributes anchor points uniformly on the 3D unit sphere. We show that this uniform distribution provides the best visualization with minimal artificial visual correlation for data with uncorrelated variables. However, anchor points can be placed exactly equi-distant from each other only for the five Platonic solids, so we provide equi-distant anchor points for these five settings, and approximately equi-distant anchor points via a Fibonacci grid for the other cases. Our methodology, implemented in the R package $radviz3d$, makes fully 3D RadViz possible and is shown to improve the ability of this nonlinear technique in more faithfully displaying simulated data as well as the crabs, olive oils and wine datasets. Additionally, because radial visualization is naturally suited for compositional data, we use RadViz3D to illustrate (i) the chemical composition of Longquan celadon ceramics and their Jingdezhen imitation over centuries, and (ii) US regional SARS-Cov-2 variants' prevalence in the Covid-19 pandemic during the summer 2021 surge of the Delta variant.
IVSep 24, 2021
Quantitative Matching of Forensic Evidence Fragments Utilizing 3D Microscopy Analysis of Fracture Surface ReplicasBishoy Dawood, Carlos Llosa-Vite, Geoffrey Z. Thompson et al.
Fractured surfaces carry unique details that can provide an accurate quantitative comparison to support comparative forensic analysis of those fractured surfaces. In this study, a statistical analysis comparison protocol was applied to a set of 3D topological images of fractured surface pairs and their replicas to provide confidence in the quantitative statistical comparison between fractured items and their replicas. A set of 10 fractured stainless steel samples was fractured from the same metal rod under controlled conditions and were cast using a standard forensic casting technique. Six 3D topological maps with 50% overlap were acquired for each fractured pair. Spectral analysis was utilized to identify the correlation between topological surface features at different length scales of the surface topology. We selected two frequency bands over the critical wavelength (which is greater than two-grain diameters) for statistical comparison. Our statistical model utilized a matrix-variate-$t$ distribution that accounts for the image-overlap to model the match and non-match population densities. A decision rule was developed to identify the probability of matched and unmatched pairs of surfaces. The proposed methodology correctly classified the fractured steel surfaces and their replicas with a posterior probability of match exceeding 99.96%. Moreover, the replication technique shows the potential to accurately replicate fracture surface topological details with a wavelength greater than 20$μ$m, which far exceeds the range for comparison of most metallic alloys of 50-200$μ$m. The developed framework establishes the basis of forensic comparison of fractured articles and their replicas while providing a reliable quantitative statistical forensic comparison, utilizing fracture mechanics-based analysis of the fracture surface topology.
APFeb 6, 2021
A practical model-based segmentation approach for improved activation detection in single-subject functional Magnetic Resonance Imaging studiesWei-Chen Chen, Ranjan Maitra
Functional Magnetic Resonance Imaging (fMRI) maps cerebral activation in response to stimuli but this activation is often difficult to detect, especially in low-signal contexts and single-subject studies. Accurate activation detection can be guided by the fact that very few voxels are, in reality, truly activated and that these voxels are spatially localized, but it is challenging to incorporate both these facts. We address these twin challenges to single-subject and low-signal fMRI by developing a computationally feasible and methodologically sound model-based approach, implemented in the R package MixfMRI, that bounds the a priori expected proportion of activated voxels while also incorporating spatial context. An added benefit of our methodology is the ability to distinguish voxels and regions having different intensities of activation. Our suggested approach is evaluated in realistic two- and three-dimensional simulation experiments as well as on multiple real-world datasets. Finally, the value of our suggested approach in low-signal and single-subject fMRI studies is illustrated on a sports imagination experiment that is often used to detect awareness and improve treatment in patients in persistent vegetative state (PVS). Our ability to reliably distinguish activation in this experiment potentially opens the door to the adoption of fMRI as a clinical tool for the improved treatment and therapy of PVS survivors and other patients.
MEDec 18, 2020
Reduced-Rank Tensor-on-Tensor Regression and Tensor-variate Analysis of VarianceCarlos Llosa-Vite, Ranjan Maitra
Fitting regression models with many multivariate responses and covariates can be challenging, but such responses and covariates sometimes have tensor-variate structure. We extend the classical multivariate regression model to exploit such structure in two ways: first, we impose four types of low-rank tensor formats on the regression coefficients. Second, we model the errors using the tensor-variate normal distribution that imposes a Kronecker separable format on the covariance matrix. We obtain maximum likelihood estimators via block-relaxation algorithms and derive their computational complexity and asymptotic distributions. Our regression framework enables us to formulate tensor-variate analysis of variance (TANOVA) methodology. This methodology, when applied in a one-way TANOVA layout, enables us to identify cerebral regions significantly associated with the interaction of suicide attempters or non-attemptor ideators and positive-, negative- or death-connoting words in a functional Magnetic Resonance Imaging study. Another application uses three-way TANOVA on the Labeled Faces in the Wild image dataset to distinguish facial characteristics related to ethnic origin, age group and gender.
MEJun 6, 2020
An Efficient $k$-modes Algorithm for Clustering Categorical DatasetsKarin S. Dorman, Ranjan Maitra
Mining clusters from data is an important endeavor in many applications. The $k$-means method is a popular, efficient, and distribution-free approach for clustering numerical-valued data, but does not apply for categorical-valued observations. The $k$-modes method addresses this lacuna by replacing the Euclidean with the Hamming distance and the means with the modes in the $k$-means objective function. We provide a novel, computationally efficient implementation of $k$-modes, called OTQT. We prove that OTQT finds updates to improve the objective function that are undetectable to existing $k$-modes algorithms. Although slightly slower per iteration due to algorithmic complexity, OTQT is always more accurate per iteration and almost always faster (and only barely slower on some datasets) to the final optimum. Thus, we recommend OTQT as the preferred, default algorithm for $k$-modes optimization.
CVApr 20, 2020
CatSIM: A Categorical Image Similarity MetricGeoffrey Z. Thompson, Ranjan Maitra
We introduce CatSIM, a new similarity metric for binary and multinary two- and three-dimensional images and volumes. CatSIM uses a structural similarity image quality paradigm and is robust to small perturbations in location so that structures in similar, but not entirely overlapping, regions of two images are rated higher than using simple matching. The metric can also compare arbitrary regions inside images. CatSIM is evaluated on artificial data sets, image quality assessment surveys and two imaging applications
MEMar 12, 2020
Estimating Basis Functions in Massive Fields under the Spatial Mixed Effects ModelKarl T. Pazdernik, Ranjan Maitra
Spatial prediction is commonly achieved under the assumption of a Gaussian random field (GRF) by obtaining maximum likelihood estimates of parameters, and then using the kriging equations to arrive at predicted values. For massive datasets, fixed rank kriging using the Expectation-Maximization (EM) algorithm for estimation has been proposed as an alternative to the usual but computationally prohibitive kriging method. The method reduces computation cost of estimation by redefining the spatial process as a linear combination of basis functions and spatial random effects. A disadvantage of this method is that it imposes constraints on the relationship between the observed locations and the knots. We develop an alternative method that utilizes the Spatial Mixed Effects (SME) model, but allows for additional flexibility by estimating the range of the spatial dependence between the observations and the knots via an Alternating Expectation Conditional Maximization (AECM) algorithm. Experiments show that our methodology improves estimation without sacrificing prediction accuracy while also minimizing the additional computational burden of extra parameter estimation. The methodology is applied to a temperature data set archived by the United States National Climate Data Center, with improved results over previous methodology.
GAMar 12, 2020
Multi-layered characterization of hot stellar systems with confidenceSouradeep Chattopadhyay, Steven D. Kawaler, Ranjan Maitra
Understanding the physical and evolutionary properties of Hot Stellar Systems (HSS) is a major challenge in astronomy. We studied the dataset on 13456 HSS of Misgeld and Hilker (2011) that includes 12763 candidate globular clusters using stellar mass ($M_s$), effective radius ($R_e$) and mass-to-luminosity ratio ($M_s/L_ν$), and found multi-layered homogeneous grouping among these stellar systems. Our methods elicited eight homogeneous ellipsoidal groups at the finest sub-group level. Some of these groups have high overlap and were merged through a multi-phased syncytial algorithm motivated from Almodóvar-Rivera and Maitra (2020). Five groups were merged in the first phase, resulting in three complex-structured groups. Our algorithm determined further complex structure and permitted another merging phase, revealing two complex-structured groups at the highest level. A nonparametric bootstrap procedure was also used to estimate the confidence of each of our group assignments. These assignments generally had high confidence in classification, indicating great degree of certainty of the HSS assignments into our complex-structured groups. The physical and kinematic properties of the two groups were assessed in terms of $M_s$, $R_e$, surface density and $M_s/L_ν$. The first group consisted of older, smaller and less bright HSS while the second group consisted of brighter and younger HSS. Our analysis provides novel insight into the physical and evolutionary properties of HSS and also helps understand physical and evolutionary properties of candidate globular clusters. Further, the candidate globular clusters (GCs) are seen to have very high chance of really being GCs rather than dwarfs or dwarf ellipticals that are also indicated to be quite distinct from each other.
MEJul 27, 2019
A Matrix--free Likelihood Method for Exploratory Factor Analysis of High-dimensional Gaussian DataFan Dai, Somak Dutta, Ranjan Maitra
This paper proposes a novel profile likelihood method for estimating the covariance parameters in exploratory factor analysis of high-dimensional Gaussian datasets with fewer observations than number of variables. An implicitly restarted Lanczos algorithm and a limited-memory quasi-Newton method are implemented to develop a matrix-free framework for likelihood maximization. Simulation results show that our method is substantially faster than the expectation-maximization solution without sacrificing accuracy. Our method is applied to fit factor models on data from suicide attempters, suicide ideators and a control group.
MEJul 22, 2019
Classification with the matrix-variate-$t$ distributionGeoffrey Z. Thompson, Ranjan Maitra, William Q. Meeker et al.
Matrix-variate distributions can intuitively model the dependence structure of matrix-valued observations that arise in applications with multivariate time series, spatio-temporal or repeated measures. This paper develops an Expectation-Maximization algorithm for discriminant analysis and classification with matrix-variate $t$-distributions. The methodology shows promise on simulated datasets or when applied to the forensic matching of fractured surfaces or the classification of functional Magnetic Resonance, satellite or hand gestures images.
MLApr 21, 2019
TiK-means: $K$-means clustering for skewed groupsNicholas S. Berry, Ranjan Maitra
The $K$-means algorithm is extended to allow for partitioning of skewed groups. Our algorithm is called TiK-Means and contributes a $K$-means type algorithm that assigns observations to groups while estimating their skewness-transformation parameters. The resulting groups and transformation reveal general-structured clusters that can be explained by inverting the estimated transformation. Further, a modification of the jump statistic chooses the number of groups. Our algorithm is evaluated on simulated and real-life datasets and then applied to a long-standing astronomical dispute regarding the distinct kinds of gamma ray bursts.
MLApr 6, 2019
Visualization of Labeled Mixed-featured DatasetsYifan Zhu, Fan Dai, Ranjan Maitra
We develop methodology for visualization of labeled mixed-featured datasets. We first investigate datasets with continuous features where our Max-Ratio Projection (MRP) method utilizes the group information in high dimensions to provide distinctive lower-dimensional projections that are then displayed using Radviz3D. Our methodology is extended to datasets with discrete and continuous features where a Gaussianized distributional transform is used in conjunction with copula models before applying MRP and visualizing the result using RadViz3D. A R package $radviz3d$ implementing our complete methodology is available.
MEMay 24, 2018
Kernel-estimated Nonparametric Overlap-Based Syncytial ClusteringIsrael Almodóvar-Rivera, Ranjan Maitra
Commonly-used clustering algorithms usually find ellipsoidal, spherical or other regular-structured clusters, but are more challenged when the underlying groups lack formal structure or definition. Syncytial clustering is the name that we introduce for methods that merge groups obtained from standard clustering algorithms in order to reveal complex group structure in the data. Here, we develop a distribution-free fully-automated syncytial clustering algorithm that can be used with $k$-means and other algorithms. Our approach estimates the cumulative distribution function of the normed residuals from an appropriately fit $k$-groups model and calculates the estimated nonparametric overlap between each pair of clusters. Groups with high pairwise overlap are merged as long as the estimated generalized overlap decreases. Our methodology is always a top performer in identifying groups with regular and irregular structures in several datasets and can be applied to datasets with scatter or incomplete records. The approach is also used to identify the distinct kinds of gamma ray bursts in the Burst and Transient Source Experiment 4Br catalog and the distinct kinds of activation in a functional Magnetic Resonance Imaging study.
MLFeb 23, 2018
An efficient $k$-means-type algorithm for clustering datasets with incomplete recordsAndrew Lithio, Ranjan Maitra
The $k$-means algorithm is arguably the most popular nonparametric clustering method but cannot generally be applied to datasets with incomplete records. The usual practice then is to either impute missing values under an assumed missing-completely-at-random mechanism or to ignore the incomplete records, and apply the algorithm on the resulting dataset. We develop an efficient version of the $k$-means algorithm that allows for clustering in the presence of incomplete records. Our extension is called $k_m$-means and reduces to the $k$-means algorithm when all records are complete. We also provide initialization strategies for our algorithm and methods to estimate the number of groups in the dataset. Illustrations and simulations demonstrate the efficacy of our approach in a variety of settings and patterns of missing data. Our methods are also applied to the analysis of activation images obtained from a functional Magnetic Resonance Imaging experiment.
MLDec 23, 2017
Merging $K$-means with hierarchical clustering for identifying general-shaped groupsAnna D. Peterson, Arka P. Ghosh, Ranjan Maitra
Clustering partitions a dataset such that observations placed together in a group are similar but different from those in other groups. Hierarchical and $K$-means clustering are two approaches but have different strengths and weaknesses. For instance, hierarchical clustering identifies groups in a tree-like structure but suffers from computational complexity in large datasets while $K$-means clustering is efficient but designed to identify homogeneous spherically-shaped clusters. We present a hybrid non-parametric clustering approach that amalgamates the two methods to identify general-shaped clusters and that can be applied to larger datasets. Specifically, we first partition the dataset into spherical groups using $K$-means. We next merge these groups using hierarchical methods with a data-driven distance measure as a stopping criterion. Our proposal has the potential to reveal groups with general shapes and structure in a dataset. We demonstrate good performance on several simulated and real datasets.
MEDec 6, 2017
Approximations in the homogeneous Ising modelAlejandro Murua-Sazo, Ranjan Maitra
The Ising model is important in statistical modeling and inference in many applications, however its normalizing constant, mean number of active vertices and mean spin interaction -- quantities needed in inference -- are computationally intractable. We provide accurate approximations that make it possible to numerically calculate these quantities in the homogeneous case. Simulation studies indicate good performance of our approximation formulae that are scalable and unfazed by the size (number of nodes, degree of graph) of the Markov Random Field. The practical import of our approximation formulae is illustrated in performing Bayesian inference in a functional Magnetic Resonance Imaging activation detection experiment, and also in likelihood ratio testing for anisotropy in the spatial patterns of yearly increases in pistachio tree yields.