Shuchin Aeron

h-index86

62papers

2,333citations

Novelty50%

AI Score57

Ranked #15,764 of 205,806 authors (top 8%)#3,454 in LG (top 8%)

62 Papers

LGJan 29Code

The Ensemble Inverse Problem: Applications and Methods

Zhengyan Huan, Camila Pazos, Martin Klassen et al.

We introduce a new multivariate statistical problem that we refer to as the Ensemble Inverse Problem (EIP). The aim of EIP is to invert for an ensemble that is distributed according to the pushforward of a prior under a forward process. In high energy physics (HEP), this is related to a widely known problem called unfolding, which aims to reconstruct the true physics distribution of quantities, such as momentum and angle, from measurements that are distorted by detector effects. In recent applications, the EIP also arises in full waveform inversion (FWI) and inverse imaging with unknown priors. We propose non-iterative inference-time methods that construct posterior samplers based on a new class of conditional generative models, which we call ensemble inverse generative models. For the posterior modeling, these models additionally use the ensemble information contained in the observation set on top of single measurements. Unlike existing methods, our proposed methods avoid explicit and iterative use of the forward model at inference time via training across several sets of truth-observation pairs that are consistent with the same forward model, but originate from a wide range of priors. We demonstrate that this training procedure implicitly encodes the likelihood model. The use of ensemble information helps posterior inference and enables generalization to unseen priors. We benchmark the proposed method on several synthetic and real datasets in inverse imaging, HEP, and FWI. The codes are available at https://github.com/ZhengyanHuan/The-Ensemble-Inverse-Problem--Applications-and-Methods.

LGNov 9, 2023Code

Hard-Negative Sampling for Contrastive Learning: Optimal Representation Geometry and Neural- vs Dimensional-Collapse

Ruijie Jiang, Thuan Nguyen, Shuchin Aeron et al.

For a widely-studied data model and general loss and sample-hardening functions we prove that the losses of Supervised Contrastive Learning (SCL), Hard-SCL (HSCL), and Unsupervised Contrastive Learning (UCL) are minimized by representations that exhibit Neural-Collapse (NC), i.e., the class means form an Equiangular Tight Frame (ETF) and data from the same class are mapped to the same representation. We also prove that for any representation mapping, the HSCL and Hard-UCL (HUCL) losses are lower bounded by the corresponding SCL and UCL losses. In contrast to existing literature, our theoretical results for SCL do not require class-conditional independence of augmented views and work for a general loss function class that includes the widely used InfoNCE loss function. Moreover, our proofs are simpler, compact, and transparent. Similar to existing literature, our theoretical claims also hold for the practical scenario where batching is used for optimization. We empirically demonstrate, for the first time, that Adam optimization (with batching) of HSCL and HUCL losses with random initialization and suitable hardness levels can indeed converge to the NC-geometry if we incorporate unit-ball or unit-sphere feature normalization. Without incorporating hard-negatives or feature normalization, however, the representations learned via Adam suffer from Dimensional-Collapse (DC) and fail to attain the NC-geometry. These results exemplify the role of hard-negative sampling in contrastive representation learning and we conclude with several open theoretical problems for future work. The code can be found at https://github.com/rjiang03/HCL/tree/main

LGAug 31, 2022

Supervised Contrastive Learning with Hard Negative Samples

Ruijie Jiang, Thuan Nguyen, Prakash Ishwar et al.

Through minimization of an appropriate loss function such as the InfoNCE loss, contrastive learning (CL) learns a useful representation function by pulling positive samples close to each other while pushing negative samples far apart in the embedding space. The positive samples are typically created using "label-preserving" augmentations, i.e., domain-specific transformations of a given datum or anchor. In absence of class information, in unsupervised CL (UCL), the negative samples are typically chosen randomly and independently of the anchor from a preset negative sampling distribution over the entire dataset. This leads to class-collisions in UCL. Supervised CL (SCL), avoids this class collision by conditioning the negative sampling distribution to samples having labels different from that of the anchor. In hard-UCL (H-UCL), which has been shown to be an effective method to further enhance UCL, the negative sampling distribution is conditionally tilted, by means of a hardening function, towards samples that are closer to the anchor. Motivated by this, in this paper we propose hard-SCL (H-SCL) {wherein} the class conditional negative sampling distribution {is tilted} via a hardening function. Our simulation results confirm the utility of H-SCL over SCL with significant performance gains {in downstream classification tasks.} Analytically, we show that {in the} limit of infinite negative samples per anchor and a suitable assumption, the {H-SCL loss} is upper bounded by the {H-UCL loss}, thereby justifying the utility of H-UCL {for controlling} the H-SCL loss in the absence of label information. Through experiments on several datasets, we verify the assumption as well as the claimed inequality between H-UCL and H-SCL losses. We also provide a plausible scenario where H-SCL loss is lower bounded by UCL loss, indicating the limited utility of UCL in controlling the H-SCL loss.

LGAug 1, 2022

Joint covariate-alignment and concept-alignment: a framework for domain generalization

Thuan Nguyen, Boyang Lyu, Prakash Ishwar et al.

In this paper, we propose a novel domain generalization (DG) framework based on a new upper bound to the risk on the unseen domain. Particularly, our framework proposes to jointly minimize both the covariate-shift as well as the concept-shift between the seen domains for a better performance on the unseen domain. While the proposed approach can be implemented via an arbitrary combination of covariate-alignment and concept-alignment modules, in this work we use well-established approaches for distributional alignment namely, Maximum Mean Discrepancy (MMD) and covariance Alignment (CORAL), and use an Invariant Risk Minimization (IRM)-based approach for concept alignment. Our numerical results show that the proposed methods perform as well as or better than the state-of-the-art for domain generalization on several data sets.

MLMay 31, 2022

Easy Variational Inference for Categorical Models via an Independent Binary Approximation

Michael T. Wojnowicz, Shuchin Aeron, Eric L. Miller et al.

We pursue tractable Bayesian analysis of generalized linear models (GLMs) for categorical data. Thus far, GLMs are difficult to scale to more than a few dozen categories due to non-conjugacy or strong posterior dependencies when using conjugate auxiliary variable methods. We define a new class of GLMs for categorical data called categorical-from-binary (CB) models. Each CB model has a likelihood that is bounded by the product of binary likelihoods, suggesting a natural posterior approximation. This approximation makes inference straightforward and fast; using well-known auxiliary variables for probit or logistic regression, the product of binary models admits conjugate closed-form variational inference that is embarrassingly parallel across categories and invariant to category ordering. Moreover, an independent binary model simultaneously approximates multiple CB models. Bayesian model averaging over these can improve the quality of the approximation for any given dataset. We show that our approach scales to thousands of categories, outperforming posterior estimation competitors like Automatic Differentiation Variational Inference (ADVI) and No U-Turn Sampling (NUTS) in the time required to achieve fixed prediction quality.

LGApr 2, 2023

A principled approach to model validation in domain generalization

Boyang Lyu, Thuan Nguyen, Matthias Scheutz et al.

Domain generalization aims to learn a model with good generalization ability, that is, the learned model should not only perform well on several seen domains but also on unseen domains with different data distributions. State-of-the-art domain generalization methods typically train a representation function followed by a classifier jointly to minimize both the classification risk and the domain discrepancy. However, when it comes to model selection, most of these methods rely on traditional validation routines that select models solely based on the lowest classification risk on the validation set. In this paper, we theoretically demonstrate a trade-off between minimizing classification risk and mitigating domain discrepancy, i.e., it is impossible to achieve the minimum of these two objectives simultaneously. Motivated by this theoretical result, we propose a novel model selection method suggesting that the validation process should account for both the classification risk and the domain discrepancy. We validate the effectiveness of the proposed method by numerical results on several domain generalization datasets.

LGOct 26, 2022

Trade-off between reconstruction loss and feature alignment for domain generalization

Thuan Nguyen, Boyang Lyu, Prakash Ishwar et al.

Domain generalization (DG) is a branch of transfer learning that aims to train the learning models on several seen domains and subsequently apply these pre-trained models to other unseen (unknown but related) domains. To deal with challenging settings in DG where both data and label of the unseen domain are not available at training time, the most common approach is to design the classifiers based on the domain-invariant representation features, i.e., the latent representations that are unchanged and transferable between domains. Contrary to popular belief, we show that designing classifiers based on invariant representation features alone is necessary but insufficient in DG. Our analysis indicates the necessity of imposing a constraint on the reconstruction loss induced by representation functions to preserve most of the relevant information about the label in the latent space. More importantly, we point out the trade-off between minimizing the reconstruction loss and achieving domain alignment in DG. Our theoretical results motivate a new DG framework that jointly optimizes the reconstruction loss and the domain discrepancy. Both theoretical and numerical results are provided to justify our approach.

MLFeb 15, 2023

On Rank Energy Statistics via Optimal Transport: Continuity, Convergence, and Change Point Detection

Matthew Werenski, Shoaib Bin Masud, James M. Murphy et al.

This paper considers the use of recently proposed optimal transport-based multivariate test statistics, namely rank energy and its variant the soft rank energy derived from entropically regularized optimal transport, for the unsupervised nonparametric change point detection (CPD) problem. We show that the soft rank energy enjoys both fast rates of statistical convergence and robust continuity properties which lead to strong performance on real datasets. Our theoretical analyses remove the need for resampling and out-of-sample extensions previously required to obtain such rates. In contrast the rank energy suffers from the curse of dimensionality in statistical estimation and moreover can signal a change point from arbitrarily small perturbations, which leads to a high rate of false alarms in CPD. Additionally, under mild regularity conditions, we quantify the discrepancy between soft rank energy and rank energy in terms of the regularization parameter. Finally, we show our approach performs favorably in numerical experiments compared to several other optimal transport-based methods as well as maximum mean discrepancy.

SPNov 14, 2022

Alternating minimization algorithm with initialization analysis for r-local and k-sparse unlabeled sensing

Ahmed Abbasi, Shuchin Aeron, Abiy Tasissa

Unlabeled sensing is a linear inverse problem with permuted measurements. We propose an alternating minimization (AltMin) algorithm with a suitable initialization for two widely considered permutation models: partially shuffled/$k$-sparse permutations and $r$-local/block diagonal permutations. Key to the performance of the AltMin algorithm is the initialization. For the exact unlabeled sensing problem, assuming either a Gaussian measurement matrix or a sub-Gaussian signal, we bound the initialization error in terms of the number of blocks $s$ and the number of shuffles $k$. Experimental results show that our algorithm is fast, applicable to both permutation models, and robust to choice of measurement matrix. We also test our algorithm on several real datasets for the linked linear regression problem and show superior performance compared to baseline methods.

LGOct 4, 2022

Nonparametric and Regularized Dynamical Wasserstein Barycenters for Sequential Observations

Kevin C. Cheng, Shuchin Aeron, Michael C. Hughes et al.

We consider probabilistic models for sequential observations which exhibit gradual transitions among a finite number of states. We are particularly motivated by applications such as human activity analysis where observed accelerometer time series contains segments representing distinct activities, which we call pure states, as well as periods characterized by continuous transition among these pure states. To capture this transitory behavior, the dynamical Wasserstein barycenter (DWB) model of Cheng et al. in 2021 [1] associates with each pure state a data-generating distribution and models the continuous transitions among these states as a Wasserstein barycenter of these distributions with dynamically evolving weights. Focusing on the univariate case where Wasserstein distances and barycenters can be computed in closed form, we extend [1] specifically relaxing the parameterization of the pure states as Gaussian distributions. We highlight issues related to the uniqueness in identifying the model parameters as well as uncertainties induced when estimating a dynamically evolving distribution from a limited number of samples. To ameliorate non-uniqueness, we introduce regularization that imposes temporal smoothness on the dynamics of the barycentric weights. A quantile-based approximation of the pure state distributions yields a finite dimensional estimation problem which we numerically solve using cyclic descent alternating between updates to the pure-state quantile functions and the barycentric weights. We demonstrate the utility of the proposed algorithm in segmenting both simulated and real world human activity time series.

CVJul 18, 2023

Systematic comparison of semi-supervised and self-supervised learning for medical image classification

Zhe Huang, Ruijie Jiang, Shuchin Aeron et al.

In typical medical image classification problems, labeled data is scarce while unlabeled data is more available. Semi-supervised learning and self-supervised learning are two different research directions that can improve accuracy by learning from extra unlabeled data. Recent methods from both directions have reported significant gains on traditional benchmarks. Yet past benchmarks do not focus on medical tasks and rarely compare self- and semi- methods together on an equal footing. Furthermore, past benchmarks often handle hyperparameter tuning suboptimally. First, they may not tune hyperparameters at all, leading to underfitting. Second, when tuning does occur, it often unrealistically uses a labeled validation set that is much larger than the training set. Therefore currently published rankings might not always corroborate with their practical utility This study contributes a systematic evaluation of self- and semi- methods with a unified experimental protocol intended to guide a practitioner with scarce overall labeled data and a limited compute budget. We answer two key questions: Can hyperparameter tuning be effective with realistic-sized validation sets? If so, when all methods are tuned well, which self- or semi-supervised methods achieve the best accuracy? Our study compares 13 representative semi- and self-supervised methods to strong labeled-set-only baselines on 4 medical datasets. From 20000+ GPU hours of computation, we provide valuable best practices to resource-constrained practitioners: hyperparameter tuning is effective, and the semi-supervised method known as MixMatch delivers the most reliable gains across 4 datasets.

LGDec 1, 2025

Generative Modeling with Continuous Flows: Sample Complexity of Flow Matching

Mudit Gaur, Prashant Trivedi, Shuchin Aeron et al.

Flow matching has recently emerged as a promising alternative to diffusion-based generative models, offering faster sampling and simpler training by learning continuous flows governed by ordinary differential equations. Despite growing empirical success, the theoretical understanding of flow matching remains limited, particularly in terms of sample complexity results. In this work, we provide the first analysis of the sample complexity for flow-matching based generative models without assuming access to the empirical risk minimizer (ERM) of the loss function for estimating the velocity field. Under standard assumptions on the loss function for velocity field estimation and boundedness of the data distribution, we show that a sufficiently expressive neural network can learn a velocity field such that with $\mathcal{O}(ε^{-4})$ samples, such that the Wasserstein-2 distance between the learned and the true distribution is less than $\mathcal{O}(ε)$. The key technical idea is to decompose the velocity field estimation error into neural-network approximation error, statistical error due to the finite sample size, and optimization error due to the finite number of optimization steps for estimating the velocity field. Each of these terms are then handled via techniques that may be of independent interest.

LGOct 21, 2022

Geometric Sparse Coding in Wasserstein Space

Marshall Mueller, Shuchin Aeron, James M. Murphy et al.

Wasserstein dictionary learning is an unsupervised approach to learning a collection of probability distributions that generate observed distributions as Wasserstein barycentric combinations. Existing methods for Wasserstein dictionary learning optimize an objective that seeks a dictionary with sufficient representation capacity via barycentric interpolation to approximate the observed training data, but without imposing additional structural properties on the coefficients associated to the dictionary. This leads to dictionaries that densely represent the observed data, which makes interpretation of the coefficients challenging and may also lead to poor empirical performance when using the learned coefficients in downstream tasks. In contrast and motivated by sparse dictionary learning in Euclidean spaces, we propose a geometrically sparse regularizer for Wasserstein space that promotes representations of a data point using only nearby dictionary elements. We show this approach leads to sparse representations in Wasserstein space and addresses the problem of non-uniqueness of barycentric representation. Moreover, when data is generated as Wasserstein barycenters of fixed distributions, this regularizer facilitates the recovery of the generating distributions in cases that are ill-posed for unregularized Wasserstein dictionary learning. Through experimentation on synthetic and real data, we show that our geometrically regularized approach yields sparser and more interpretable dictionaries in Wasserstein space, which perform better in downstream applications.

MLNov 20, 2023

Estimation of entropy-regularized optimal transport maps between non-compactly supported measures

Matthew Werenski, James M. Murphy, Shuchin Aeron

This paper addresses the problem of estimating entropy-regularized optimal transport (EOT) maps with squared-Euclidean cost between source and target measures that are subGaussian. In the case that the target measure is compactly supported or strongly log-concave, we show that for a recently proposed in-sample estimator, the expected squared $L^2$-error decays at least as fast as $O(n^{-1/3})$ where $n$ is the sample size. For the general subGaussian case we show that the expected $L^1$-error decays at least as fast as $O(n^{-1/6})$, and in both cases we have polynomial dependence on the regularization parameter. While these results are suboptimal compared to known results in the case of compactness of both the source and target measures (squared $L^2$-error converging at a rate $O(n^{-1})$) and for when the source is subGaussian while the target is compactly supported (squared $L^2$-error converging at a rate $O(n^{-1/2})$), their importance lie in eliminating the compact support requirements. The proof technique makes use of a bias-variance decomposition where the variance is controlled using standard concentration of measure results and the bias is handled by T1-transport inequalities along with sample complexity results in estimation of EOT cost under subGaussian assumptions. Our experimental results point to a looseness in controlling the variance terms and we conclude by posing several open problems.

LGAug 18, 2025Code

Efficient Constraint-Aware Flow Matching via Randomized Exploration

Zhengyan Huan, Jacob Boerma, Li-Ping Liu et al.

We consider the problem of generating samples via Flow Matching (FM) with an additional requirement that the generated samples must satisfy given constraints. We consider two scenarios, viz.: (a) when a differentiable distance function to the constraint set is given, and (b) when the constraint set is only available via queries to a membership oracle. For case (a), we propose a simple adaptation of the FM objective with an additional term that penalizes the distance between the constraint set and the generated samples. For case (b), we propose to employ randomization and learn a mean flow that is numerically shown to have a high likelihood of satisfying the constraints. This approach deviates significantly from existing works that require simple convex constraints, knowledge of a barrier function, or a reflection mechanism to constrain the probability flow. Furthermore, in the proposed setting we show that a two-stage approach, where both stages approximate the same original flow but with only the second stage probing the constraints via randomization, is more computationally efficient. Through several synthetic cases of constrained generation, we numerically show that the proposed approaches achieve significant gains in terms of constraint satisfaction while matching the target distributions. As a showcase for a practical oracle-based constraint, we show how our approach can be used for training an adversarial example generator, using queries to a hard-label black-box classifier. We conclude with several future research directions. Our code is available at https://github.com/ZhengyanHuan/FM-RE.

48.9LGMay 11

Optimal Representations for Generalized Contrastive Learning with Imbalanced Datasets

Thuan Nguyen, Shuchin Aeron, D. Richard Brown et al.

In this paper, we provide a computable characterization of the geometry of optimal representations in Contrastive Learning (CL) when the classes are imbalanced. When classes are balanced and the representation dimension is greater than the number of classes, it is well-known that the optimal representations exhibit Neural Collapse (NC), i.e., representations from the same class collapse to their class means and the class means form an Equiangular Tight Frame (ETF). For imbalanced classes and a large, generalized family of CL losses, we prove that the optimal representations of all samples from the same class collapse to their class means and their geometry exhibits an angular symmetry structure that is determined by the relative class proportions. In general, we show that the geometry can be determined by solving a convex optimization problem. Exploiting this symmetry structure, we analytically investigate a special case where class imbalance is extreme and prove that CL exhibits a phenomenon called Minority Collapse (MC) where all samples from the minority classes (classes with small probabilities) collapse into a single vector, whenever the class imbalance exceeds a threshold, which in turn depends on the regularity properties of the CL loss used and on the number of negative samples. Numerical results are provided to illustrate these phenomena and corroborate the theoretical results. We conclude by identifying a number of open problems.

MLJan 13, 2025

Synthesis and Analysis of Data as Probability Measures with Entropy-Regularized Optimal Transport

Brendan Mallery, James M. Murphy, Shuchin Aeron

We consider synthesis and analysis of probability measures using the entropy-regularized Wasserstein-2 cost and its unbiased version, the Sinkhorn divergence. The synthesis problem consists of computing the barycenter, with respect to these costs, of reference measures given a set of coefficients belonging to the simplex. The analysis problem consists of finding the coefficients for the closest barycenter in the Wasserstein-2 distance to a given measure. Under the weakest assumptions on the measures thus far in the literature, we compute the derivative of the entropy-regularized Wasserstein-2 cost. We leverage this to establish a characterization of barycenters with respect to the entropy-regularized Wasserstein-2 cost as solutions that correspond to a fixed point of an average of the entropy-regularized displacement maps. This characterization yields a finite-dimensional, convex, quadratic program for solving the analysis problem when the measure being analyzed is a barycenter with respect to the entropy-regularized Wasserstein-2 cost. We show that these coefficients, as well as the value of the barycenter functional, can be estimated from samples with dimension-independent rates of convergence, and that barycentric coefficients are stable with respect to perturbations in the Wasserstein-2 metric. We employ the barycentric coefficients as features for classification of corrupted point cloud data, and show that compared to neural network baselines, our approach is more efficient in small training data regimes.

CLNov 25, 2025

From Words to Wisdom: Discourse Annotation and Baseline Models for Student Dialogue Understanding

Farjana Sultana Mim, Shuchin Aeron, Eric Miller et al.

Identifying discourse features in student conversations is quite important for educational researchers to recognize the curricular and pedagogical variables that cause students to engage in constructing knowledge rather than merely completing tasks. The manual analysis of student conversations to identify these discourse features is time-consuming and labor-intensive, which limits the scale and scope of studies. Leveraging natural language processing (NLP) techniques can facilitate the automatic detection of these discourse features, offering educational researchers scalable and data-driven insights. However, existing studies in NLP that focus on discourse in dialogue rarely address educational data. In this work, we address this gap by introducing an annotated educational dialogue dataset of student conversations featuring knowledge construction and task production discourse. We also establish baseline models for automatically predicting these discourse properties for each turn of talk within conversations, using pre-trained large language models GPT-3.5 and Llama-3.1. Experimental results indicate that these state-of-the-art models perform suboptimally on this task, indicating the potential for future research.

MLOct 31, 2024

Linearized Wasserstein Barycenters: Synthesis, Analysis, Representational Capacity, and Applications

Matthew Werenski, Brendan Mallery, Shuchin Aeron et al.

We propose the linear barycentric coding model (LBCM) which utilizes the linear optimal transport (LOT) metric for analysis and synthesis of probability measures. We provide a closed-form solution to the variational problem characterizing the probability measures in the LBCM and establish equivalence of the LBCM to the set of 2-Wasserstein barycenters in the special case of compatible measures. Computational methods for synthesizing and analyzing measures in the LBCM are developed with finite sample guarantees. One of our main theoretical contributions is to identify an LBCM, expressed in terms of a simple family, which is sufficient to express all probability measures on the closed unit interval. We show that a natural analogous construction of an LBCM in 2 dimensions fails, and we leave it as an open problem to identify the proper extension in more than 1 dimension. We conclude by demonstrating the utility of LBCM for covariance estimation and data imputation.

MLJan 28, 2022

Measure Estimation in the Barycentric Coding Model

Matthew Werenski, Ruijie Jiang, Abiy Tasissa et al.

This paper considers the problem of measure estimation under the barycentric coding model (BCM), in which an unknown measure is assumed to belong to the set of Wasserstein-2 barycenters of a finite set of known measures. Estimating a measure under this model is equivalent to estimating the unknown barycentric coordinates. We provide novel geometrical, statistical, and computational insights for measure estimation under the BCM, consisting of three main results. Our first main result leverages the Riemannian geometry of Wasserstein-2 space to provide a procedure for recovering the barycentric coordinates as the solution to a quadratic optimization problem assuming access to the true reference measures. The essential geometric insight is that the parameters of this quadratic problem are determined by inner products between the optimal displacement maps from the given measure to the reference measures defining the BCM. Our second main result then establishes an algorithm for solving for the coordinates in the BCM when all the measures are observed empirically via i.i.d. samples. We prove precise rates of convergence for this algorithm -- determined by the smoothness of the underlying measures and their dimensionality -- thereby guaranteeing its statistical consistency. Finally, we demonstrate the utility of the BCM and associated estimation procedures in three application areas: (i) covariance estimation for Gaussian measures; (ii) image processing; and (iii) natural language processing.

LGJan 25, 2022

Conditional entropy minimization principle for learning domain invariant representation features

Thuan Nguyen, Boyang Lyu, Prakash Ishwar et al.

Invariance-principle-based methods such as Invariant Risk Minimization (IRM), have recently emerged as promising approaches for Domain Generalization (DG). Despite promising theory, such approaches fail in common classification tasks due to the mixing of true invariant features and spurious invariant features. To address this, we propose a framework based on the conditional entropy minimization (CEM) principle to filter-out the spurious invariant features leading to a new algorithm with a better generalization capability. We show that our proposed approach is closely related to the well-known Information Bottleneck (IB) framework and prove that under certain assumptions, entropy minimization can exactly recover the true invariant features. Our approach provides competitive classification accuracy compared to recent theoretically-principled state-of-the-art alternatives across several DG datasets.

LGNov 4, 2021

Hard Negative Sampling via Regularized Optimal Transport for Contrastive Representation Learning

Ruijie Jiang, Prakash Ishwar, Shuchin Aeron

We study the problem of designing hard negative sampling distributions for unsupervised contrastive representation learning. We propose and analyze a novel min-max framework that seeks a representation which minimizes the maximum (worst-case) generalized contrastive learning loss over all couplings (joint distributions between positive and negative samples subject to marginal constraints) and prove that the resulting min-max optimum representation will be degenerate. This provides the first theoretical justification for incorporating additional regularization constraints on the couplings. We re-interpret the min-max problem through the lens of Optimal Transport (OT) theory and utilize regularized transport couplings to control the degree of hardness of negative examples. Through experiments we demonstrate that the negative samples generated from our designed negative distribution are more similar to the anchor than those generated from the baseline negative distribution. We also demonstrate that entropic regularization yields negative sampling distributions with parametric form similar to that in a recent state-of-the-art negative sampling design and has similar performance in multiple datasets. Utilizing the uncovered connection with OT, we propose a new ground cost for designing the negative distribution and show improved performance of the learned representation on downstream tasks compared to the representation learned when using squared Euclidean cost.

CLNov 1, 2021

Interpretable contrastive word mover's embedding

Ruijie Jiang, Julia Gouvea, Eric Miller et al.

This paper shows that a popular approach to the supervised embedding of documents for classification, namely, contrastive Word Mover's Embedding, can be significantly enhanced by adding interpretability. This interpretability is achieved by incorporating a clustering promoting mechanism into the contrastive loss. On several public datasets, we show that our method improves significantly upon existing baselines while providing interpretation to the clusters via identifying a set of keywords that are the most representative of a particular class. Our approach was motivated in part by the need to develop Natural Language Processing (NLP) methods for the \textit{novel problem of assessing student work for scientific writing and thinking} - a problem that is central to the area of (educational) Learning Sciences (LS). In this context, we show that our approach leads to a meaningful assessment of the student work related to lab reports from a biology class and can help LS researchers gain insights into student understanding and assess evidence of scientific thought processes.

MLOct 29, 2021

Robust and efficient change point detection using novel multivariate rank-energy GoF test

Shoaib Bin Masud, Shuchin Aeron

In this paper, we use and further develop upon a recently proposed multivariate, distribution-free Goodness-of-Fit (GoF) test based on the theory of Optimal Transport (OT) called the Rank Energy (RE) [1], for non-parametric and unsupervised Change Point Detection (CPD) in multivariate time series data. We show that directly using RE leads to high sensitivity to very small changes in distributions (causing high false alarms) and it requires large sample complexity and huge computational cost. To alleviate these drawbacks, we propose a new GoF test statistic called as soft-Rank Energy (sRE) that is based on entropy regularized OT and employ it towards CPD. We discuss the advantages of using sRE over RE and demonstrate that the proposed sRE based CPD outperforms all the existing methods in terms of Area Under the Curve (AUC) and F1-score on real and synthetic data sets.

MLOct 29, 2021

Multivariate rank via entropic optimal transport: sample efficiency and generative modeling

Shoaib Bin Masud, Matthew Werenski, James M. Murphy et al.

The framework of optimal transport has been leveraged to extend the notion of rank to the multivariate setting while preserving desirable properties of the resulting goodness-of-fit (GoF) statistics. In particular, the rank energy (RE) and rank maximum mean discrepancy (RMMD) are distribution-free under the null, exhibit high power in statistical testing, and are robust to outliers. In this paper, we point to and alleviate some of the practical shortcomings of these proposed GoF statistics, namely their high computational cost, high statistical sample complexity, and lack of differentiability with respect to the data. We show that all these practically important issues are addressed by considering entropy-regularized optimal transport maps in place of the rank map, which we refer to as the soft rank. We consequently propose two new statistics, the soft rank energy (sRE) and soft rank maximum mean discrepancy (sRMMD), which exhibit several desirable properties. Given $n$ sample data points, we provide non-asymptotic convergence rates for the sample estimate of the entropic transport map to its population version that are essentially of the order $n^{-1/2}$ when the starting measure is subgaussian and the target measure has compact support. This result is novel compared to existing results which achieve a rate of $n^{-1}$ but crucially rely on both measures having compact support. We leverage this result to demonstrate fast convergence of sample sRE and sRMMD to their population version making them useful for high-dimensional GoF testing. Our statistics are differentiable and amenable to popular machine learning frameworks that rely on gradient methods. We leverage these properties towards showcasing the utility of the proposed statistics for generative modeling on two important problems: image generation and generating valid knockoffs for controlled feature selection.

LGOct 13, 2021

Dynamical Wasserstein Barycenters for Time-series Modeling

Kevin C. Cheng, Shuchin Aeron, Michael C. Hughes et al.

Many time series can be modeled as a sequence of segments representing high-level discrete states, such as running and walking in a human activity application. Flexible models should describe the system state and observations in stationary "pure-state" periods as well as transition periods between adjacent segments, such as a gradual slowdown between running and walking. However, most prior work assumes instantaneous transitions between pure discrete states. We propose a dynamical Wasserstein barycentric (DWB) model that estimates the system state over time as well as the data-generating distributions of pure states in an unsupervised manner. Our model assumes each pure state generates data from a multivariate normal distribution, and characterizes transitions between states via displacement-interpolation specified by the Wasserstein barycenter. The system state is represented by a barycentric weight vector which evolves over time via a random walk on the simplex. Parameter learning leverages the natural Riemannian geometry of Gaussian distributions under the Wasserstein distance, which leads to improved convergence speeds. Experiments on several human activity datasets show that our proposed DWB model accurately learns the generating distribution of pure states while improving state estimation for transition periods compared to the commonly used linear interpolation mixture models.

LGSep 4, 2021

Barycentric-alignment and reconstruction loss minimization for domain generalization

Boyang Lyu, Thuan Nguyen, Prakash Ishwar et al.

This paper advances the theory and practice of Domain Generalization (DG) in machine learning. We consider the typical DG setting where the hypothesis is composed of a representation mapping followed by a labeling function. Within this setting, the majority of popular DG methods aim to jointly learn the representation and the labeling functions by minimizing a well-known upper bound for the classification risk in the unseen domain. In practice, however, methods based on this theoretical upper bound ignore a term that cannot be directly optimized due to its dual dependence on both the representation mapping and the unknown optimal labeling function in the unseen domain. To bridge this gap between theory and practice, we introduce a new upper bound that is free of terms having such dual dependence, resulting in a fully optimizable risk upper bound for the unseen domain. Our derivation leverages classical and recent transport inequalities that link optimal transport metrics with information-theoretic measures. Compared to previous bounds, our bound introduces two new terms: (i) the Wasserstein-2 barycenter term that aligns distributions between domains, and (ii) the reconstruction loss term that assesses the quality of representation in reconstructing the original data. Based on this new upper bound, we propose a novel DG algorithm named Wasserstein Barycenter Auto-Encoder (WBAE) that simultaneously minimizes the classification loss, the barycenter loss, and the reconstruction loss. Numerical results demonstrate that the proposed method outperforms current state-of-the-art DG algorithms on several datasets.

MLMar 16, 2021

Soft and subspace robust multivariate rank tests based on entropy regularized optimal transport

Shoaib Bin Masud, Boyang Lyu, Shuchin Aeron

In this paper, we extend the recently proposed multivariate rank energy distance, based on the theory of optimal transport, for statistical testing of distributional similarity, to soft rank energy distance. Being differentiable, this in turn allows us to extend the rank energy to a subspace robust rank energy distance, dubbed Projected soft-Rank Energy distance, which can be computed via optimization over the Stiefel manifold. We show via experiments that using projected soft rank energy one can trade-off the detection power vs the false alarm via projections onto an appropriately selected low dimensional subspace. We also show the utility of the proposed tests on unsupervised change point detection in multivariate time series data. All codes are publicly available at the link provided in the experiment section.

ITMar 12, 2021

Multiview Sensing With Unknown Permutations: An Optimal Transport Approach

Yanting Ma, Petros T. Boufounos, Hassan Mansour et al.

In several applications, including imaging of deformable objects while in motion, simultaneous localization and mapping, and unlabeled sensing, we encounter the problem of recovering a signal that is measured subject to unknown permutations. In this paper we take a fresh look at this problem through the lens of optimal transport (OT). In particular, we recognize that in most practical applications the unknown permutations are not arbitrary but some are more likely to occur than others. We exploit this by introducing a regularization function that promotes the more likely permutations in the solution. We show that, even though the general problem is not convex, an appropriate relaxation of the resulting regularized problem allows us to exploit the well-developed machinery of OT and develop a tractable algorithm.

LGNov 26, 2020

Automatic coding of students' writing via Contrastive Representation Learning in the Wasserstein space

Ruijie Jiang, Julia Gouvea, David Hammer et al.

Qualitative analysis of verbal data is of central importance in the learning sciences. It is labor-intensive and time-consuming, however, which limits the amount of data researchers can include in studies. This work is a step towards building a statistical machine learning (ML) method for achieving an automated support for qualitative analyses of students' writing, here specifically in score laboratory reports in introductory biology for sophistication of argumentation and reasoning. We start with a set of lab reports from an undergraduate biology course, scored by a four-level scheme that considers the complexity of argument structure, the scope of evidence, and the care and nuance of conclusions. Using this set of labeled data, we show that a popular natural language modeling processing pipeline, namely vector representation of words, a.k.a word embeddings, followed by Long Short Term Memory (LSTM) model for capturing language generation as a state-space model, is able to quantitatively capture the scoring, with a high Quadratic Weighted Kappa (QWK) prediction score, when trained in via a novel contrastive learning set-up. We show that the ML algorithm approached the inter-rater reliability of human analysis. Ultimately, we conclude, that machine learning (ML) for natural language processing (NLP) holds promise for assisting learning sciences researchers in conducting qualitative studies at much larger scales than is currently possible.

LGJul 22, 2020

Robust Machine Learning via Privacy/Rate-Distortion Theory

Ye Wang, Shuchin Aeron, Adnan Siraj Rakin et al.

Robust machine learning formulations have emerged to address the prevalent vulnerability of deep neural networks to adversarial examples. Our work draws the connection between optimal robust learning and the privacy-utility tradeoff problem, which is a generalization of the rate-distortion problem. The saddle point of the game between a robust classifier and an adversarial perturbation can be found via the solution of a maximum conditional entropy problem. This information-theoretic perspective sheds light on the fundamental tradeoff between robustness and clean data performance, which ultimately arises from the geometric structure of the underlying data distribution and perturbation constraints.

LGJul 11, 2020

Representation Learning via Adversarially-Contrastive Optimal Transport

Anoop Cherian, Shuchin Aeron

In this paper, we study the problem of learning compact (low-dimensional) representations for sequential data that captures its implicit spatio-temporal cues. To maximize extraction of such informative cues from the data, we set the problem within the context of contrastive representation learning and to that end propose a novel objective via optimal transport. Specifically, our formulation seeks a low-dimensional subspace representation of the data that jointly (i) maximizes the distance of the data (embedded in this subspace) from an adversarial data distribution under the optimal transport, a.k.a. the Wasserstein distance, (ii) captures the temporal order, and (iii) minimizes the data distortion. To generate the adversarial distribution, we propose a novel framework connecting Wasserstein GANs with a classifier, allowing a principled mechanism for producing good negative distributions for contrastive learning, which is currently a challenging problem. Our full objective is cast as a subspace learning problem on the Grassmann manifold and solved via Riemannian optimization. To empirically study our formulation, we provide experiments on the task of human action recognition in video sequences. Our results demonstrate competitive performance against challenging baselines.

CVJul 2, 2020

Domain Adaptation for Robust Workload Level Alignment Between Sessions and Subjects using fNIRS

Boyang Lyu, Thao Pham, Giles Blaney et al.

Significance: We demonstrated the potential of using domain adaptation on functional Near-Infrared Spectroscopy (fNIRS) data to classify different levels of n-back tasks that involve working memory. Aim: Domain shift in fNIRS data is a challenge in the workload level alignment across different experiment sessions and subjects. In order to address this problem, two domain adaptation approaches -- Gromov-Wasserstein (G-W) and Fused Gromov-Wasserstein (FG-W) were used. Approach: Specifically, we used labeled data from one session or one subject to classify trials in another session (within the same subject) or another subject. We applied G-W for session-by-session alignment and FG-W for subject-by-subject alignment to fNIRS data acquired during different n-back task levels. We compared these approaches with three supervised methods: multi-class Support Vector Machine (SVM), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN). Results: In a sample of six subjects, G-W resulted in an alignment accuracy of 68 $\pm$ 4 % (weighted mean $\pm$ standard error) for session-by-session alignment, FG-W resulted in an alignment accuracy of 55 $\pm$ 2 % for subject-by-subject alignment. In each of these cases, 25 % accuracy represents chance. Alignment accuracy results from both G-W and FG-W are significantly greater than those from SVM, CNN and RNN. We also showed that removal of motion artifacts from the fNIRS data plays an important role in improving alignment performance. Conclusions: Domain adaptation has potential for session-by-session and subject-by-subject alignment of mental workload by using fNIRS data.

SPJun 9, 2020

On Matched Filtering for Statistical Change Point Detection

Kevin C. Cheng, Eric L. Miller, Michael C. Hughes et al.

Non-parametric and distribution-free two-sample tests have been the foundation of many change point detection algorithms. However, randomness in the test statistic as a function of time makes them susceptible to false positives and localization ambiguity. We address these issues by deriving and applying filters matched to the expected temporal signatures of a change for various sliding window, two-sample tests under IID assumptions on the data. These filters are derived asymptotically with respect to the window size for the Wasserstein quantile test, the Wasserstein-1 distance test, Maximum Mean Discrepancy squared (MMD^2), and the Kolmogorov-Smirnov (KS) test. The matched filters are shown to have two important properties. First, they are distribution-free, and thus can be applied without prior knowledge of the underlying data distributions. Second, they are peak-preserving, which allows the filtered signal produced by our methods to maintain expected statistical significance. Through experiments on synthetic data as well as activity recognition benchmarks, we demonstrate the utility of this approach for mitigating false positives and improving the test precision. Our method allows for the localization of change points without the use of ad-hoc post-processing to remove redundant detections common to current methods. We further highlight the performance of statistical tests based on the Quantile-Quantile (Q-Q) function and show how the invariance property of the Q-Q function to order-preserving transformations allows these tests to detect change points of different scales with a single threshold within the same dataset.

SPNov 14, 2019

R-local unlabeled sensing: A novel graph matching approach for multiview unlabeled sensing under local permutations

Ahmed Abbasi, Abiy Tasissa, Shuchin Aeron

Unlabeled sensing is a linear inverse problem where the measurements are scrambled under an unknown permutation leading to loss of correspondence between the measurements and the rows of the sensing matrix. Motivated by practical tasks such as mobile sensor networks, target tracking and the pose and correspondence estimation between point clouds, we study a special case of this problem restricting the class of permutations to be local and allowing for multiple views. In this setting, namely unlabeled multi-view sensing with local permutation, previous results and algorithms are not directly applicable. In this paper, we propose a computationally efficient algorithm that creatively exploits the machinery of graph alignment and Gromov-Wasserstein alignment and leverages the multiple views to estimate the local permutations. Simulation results on synthetic data sets indicate that the proposed algorithm is scalable and applicable to the challenging regimes of low to moderate SNR.

SPNov 4, 2019

Optimal Transport Based Change Point Detection and Time Series Segment Clustering

Kevin C. Cheng, Shuchin Aeron, Michael C. Hughes et al.

Two common problems in time series analysis are the decomposition of the data stream into disjoint segments that are each in some sense "homogeneous" - a problem known as Change Point Detection (CPD) - and the grouping of similar nonadjacent segments, a problem that we call Time Series Segment Clustering (TSSC). Building upon recent theoretical advances characterizing the limiting distribution-free behavior of the Wasserstein two-sample test (Ramdas et al. 2015), we propose a novel algorithm for unsupervised, distribution-free CPD which is amenable to both offline and online settings. We also introduce a method to mitigate false positives in CPD and address TSSC by using the Wasserstein distance between the detected segments to build an affinity matrix to which we apply spectral clustering. Results on both synthetic and real data sets show the benefits of the approach.

OCAug 2, 2019

On the modes of convergence of Stochastic Optimistic Mirror Descent (OMD) for saddle point problems

Yanting Ma, Shuchin Aeron, Hassan Mansour

In this article, we study the convergence of Mirror Descent (MD) and Optimistic Mirror Descent (OMD) for saddle point problems satisfying the notion of coherence as proposed in Mertikopoulos et al. We prove convergence of OMD with exact gradients for coherent saddle point problems, and show that monotone convergence only occurs after some sufficiently large number of iterations. This is in contrast to the claim in Mertikopoulos et al. of monotone convergence of OMD with exact gradients for coherent saddle point problems. Besides highlighting this important subtlety, we note that the almost sure convergence guarantees of MD and OMD with stochastic gradients for strictly coherent saddle point problems that are claimed in Mertikopoulos et al. are not fully justified by their proof. As such, we fill out the missing details in the proof and as a result have only been able to prove convergence with high probability. We would like to note that our analysis relies heavily on the core ideas and proof techniques introduced in Zhou et al. and Mertikopoulos et al., and we only aim to re-state and correct the results in light of what we were able to prove rigorously while filling in the much needed missing details in their proofs.

MLAug 20, 2018

Multi-View Graph Embedding Using Randomized Shortest Paths

Anuththari Gamage, Brian Rappaport, Shuchin Aeron et al.

Real-world data sets often provide multiple types of information about the same set of entities. This data is well represented by multi-view graphs, which consist of several distinct sets of edges over the same nodes. These can be used to analyze how entities interact from different viewpoints. Combining multiple views improves the quality of inferences drawn from the underlying data, which has increased interest in developing efficient multi-view graph embedding methods. We propose an algorithm, C-RSP, that generates a common (C) embedding of a multi-view graph using Randomized Shortest Paths (RSP). This algorithm generates a dissimilarity measure between nodes by minimizing the expected cost of a random walk between any two nodes across all views of a multi-view graph, in doing so encoding both the local and global structure of the graph. We test C-RSP on both real and synthetic data and show that it outperforms benchmark algorithms at embedding and clustering tasks while remaining computationally efficient.

LGMar 13, 2018

Principal Component Analysis with Tensor Train Subspace

Wenqi Wang, Vaneet Aggarwal, Shuchin Aeron

Tensor train is a hierarchical tensor network structure that helps alleviate the curse of dimensionality by parameterizing large-scale multidimensional data via a set of network of low-rank tensors. Associated with such a construction is a notion of Tensor Train subspace and in this paper we propose a TT-PCA algorithm for estimating this structured subspace from the given data. By maintaining low rank tensor structure, TT-PCA is more robust to noise comparing with PCA or Tucker-PCA. This is borne out numerically by testing the proposed approach on the Extended YaleFace Dataset B.

LGDec 3, 2017

Tensor Train Neighborhood Preserving Embedding

Wenqi Wang, Vaneet Aggarwal, Shuchin Aeron

In this paper, we propose a Tensor Train Neighborhood Preserving Embedding (TTNPE) to embed multi-dimensional tensor data into low dimensional tensor subspace. Novel approaches to solve the optimization problem in TTNPE are proposed. For this embedding, we evaluate novel trade-off gain among classification, computation, and dimensionality reduction (storage) for supervised learning. It is shown that compared to the state-of-the-arts tensor embedding methods, TTNPE achieves superior trade-off in classification, computation, and dimensionality reduction in MNIST handwritten digits and Weizmann face datasets.

MLAug 26, 2017

Faster Clustering via Non-Backtracking Random Walks

Brian Rappaport, Anuththari Gamage, Shuchin Aeron

This paper presents VEC-NBT, a variation on the unsupervised graph clustering technique VEC, which improves upon the performance of the original algorithm significantly for sparse graphs. VEC employs a novel application of the state-of-the-art word2vec model to embed a graph in Euclidean space via random walks on the nodes of the graph. In VEC-NBT, we modify the original algorithm to use a non-backtracking random walk instead of the normal backtracking random walk used in VEC. We introduce a modification to a non-backtracking random walk, which we call a begrudgingly-backtracking random walk, and show empirically that using this model of random walks for VEC-NBT requires shorter walks on the graph to obtain results with comparable or greater accuracy than VEC, especially for sparser graphs.

LGJul 23, 2017

Efficient Low Rank Tensor Ring Completion

Wenqi Wang, Vaneet Aggarwal, Shuchin Aeron

Using the matrix product state (MPS) representation of the recently proposed tensor ring decompositions, in this paper we propose a tensor completion algorithm, which is an alternating minimization algorithm that alternates over the factors in the MPS representation. This development is motivated in part by the success of matrix completion algorithms that alternate over the (low-rank) factors. In this paper, we propose a spectral initialization for the tensor ring completion algorithm and analyze the computational complexity of the proposed algorithm. We numerically compare it with existing methods that employ a low rank tensor train approximation for data completion and show that our method outperforms the existing ones for a variety of real computer vision settings, and thus demonstrate the improved expressive power of tensor ring as compared to tensor train.

LGJun 18, 2017

Sample, computation vs storage tradeoffs for classification using tensor subspace models

Mohammadhossein Chaghazardi, Shuchin Aeron

In this paper, we exhibit the tradeoffs between the (training) sample, computation and storage complexity for the problem of supervised classification using signal subspace estimation. Our main tool is the use of tensor subspaces, i.e. subspaces with a Kronecker structure, for embedding the data into lower dimensions. Among the subspaces with a Kronecker structure, we show that using subspaces with a hierarchical structure for representing data leads to improved tradeoffs. One of the main reasons for the improvement is that embedding data into these hierarchical Kronecker structured subspaces prevents overfitting at higher latent dimensions.

MLApr 10, 2017

Word Embeddings via Tensor Factorization

Eric Bailey, Shuchin Aeron

Most popular word embedding techniques involve implicit or explicit factorization of a word co-occurrence based matrix into low rank factors. In this paper, we aim to generalize this trend by using numerical methods to factor higher-order word co-occurrence based arrays, or \textit{tensors}. We present four word embeddings using tensor factorization and analyze their advantages and disadvantages. One of our main contributions is a novel joint symmetric tensor factorization technique related to the idea of coupled tensor factorization. We show that embeddings based on tensor factorization can be used to discern the various meanings of polysemous words without being explicitly trained to do so, and motivate the intuition behind why this works in a way that doesn't with existing methods. We also modify an existing word embedding evaluation metric known as Outlier Detection [Camacho-Collados and Navigli, 2016] to evaluate the quality of the order-$N$ relations that a word embedding captures, and show that tensor-based methods outperform existing matrix-based methods at this task. Experimentally, we show that all of our word embeddings either outperform or are competitive with state-of-the-art baselines commonly used today on a variety of recent datasets. Suggested applications of tensor factorization-based word embeddings are given, and all source code and pre-trained vectors are publicly available online.

MLOct 15, 2016

Unsupervised clustering under the Union of Polyhedral Cones (UOPC) model

Wenqi Wang, Vaneet Aggarwal, Shuchin Aeron

In this paper, we consider clustering data that is assumed to come from one of finitely many pointed convex polyhedral cones. This model is referred to as the Union of Polyhedral Cones (UOPC) model. Similar to the Union of Subspaces (UOS) model where each data from each subspace is generated from a (unknown) basis, in the UOPC model each data from each cone is assumed to be generated from a finite number of (unknown) \emph{extreme rays}.To cluster data under this model, we consider several algorithms - (a) Sparse Subspace Clustering by Non-negative constraints Lasso (NCL), (b) Least squares approximation (LSA), and (c) K-nearest neighbor (KNN) algorithm to arrive at affinity between data points. Spectral Clustering (SC) is then applied on the resulting affinity matrix to cluster data into different polyhedral cones. We show that on an average KNN outperforms both NCL and LSA and for this algorithm we provide the deterministic conditions for correct clustering. For an affinity measure between the cones it is shown that as long as the cones are not very coherent and as long as the density of data within each cone exceeds a threshold, KNN leads to accurate clustering. Finally, simulation results on real datasets (MNIST and YaleFace datasets) depict that the proposed algorithm works well on real data indicating the utility of the UOPC model and the proposed algorithm.

LGOct 5, 2016

Low-tubal-rank Tensor Completion using Alternating Minimization

Xiao-Yang Liu, Shuchin Aeron, Vaneet Aggarwal et al.

The low-tubal-rank tensor model has been recently proposed for real-world multidimensional data. In this paper, we study the low-tubal-rank tensor completion problem, i.e., to recover a third-order tensor by observing a subset of its elements selected uniformly at random. We propose a fast iterative algorithm, called {\em Tubal-Alt-Min}, that is inspired by a similar approach for low-rank matrix completion. The unknown low-tubal-rank tensor is represented as the product of two much smaller tensors with the low-tubal-rank property being automatically incorporated, and Tubal-Alt-Min alternates between estimating those two tensors using tensor least squares minimization. First, we note that tensor least squares minimization is different from its matrix counterpart and nontrivial as the circular convolution operator of the low-tubal-rank tensor model is intertwined with the sub-sampling operator. Second, the theoretical performance guarantee is challenging since Tubal-Alt-Min is iterative and nonconvex in nature. We prove that 1) Tubal-Alt-Min guarantees exponential convergence to the global optima, and 2) for an $n \times n \times k$ tensor with tubal-rank $r \ll n$, the required sampling complexity is $O(nr^2k \log^3 n)$ and the computational complexity is $O(n^2rk^2 \log^2 n)$. Third, on both synthetic data and real-world video data, evaluation results show that compared with tensor-nuclear norm minimization (TNN-ADMM), Tubal-Alt-Min improves the recovery error dramatically (by orders of magnitude). It is estimated that Tubal-Alt-Min converges at an exponential rate $10^{-0.4423 \text{Iter}}$ where $\text{Iter}$ denotes the number of iterations, which is much faster than TNN-ADMM's $10^{-0.0332 \text{Iter}}$, and the running time can be accelerated by more than $5$ times for a $200 \times 200 \times 20$ tensor.

LGSep 29, 2016

Algorithms for item categorization based on ordinal ranking data

Josh Girson, Shuchin Aeron

We present a new method for identifying the latent categorization of items based on their rankings. Complimenting a recent work that uses a Dirichlet prior on preference vectors and variational inference, we show that this problem can be effectively dealt with using existing community detection algorithms, with the communities corresponding to item categories. In particular we convert the bipartite ranking data to a unipartite graph of item affinities, and apply community detection algorithms. In this context we modify an existing algorithm - namely the label propagation algorithm to a variant that uses the distance between the nodes for weighting the label propagation - to identify the categories. We propose and analyze a synthetic ordinal ranking model and show its relation to the recently much studied stochastic block model. We test our algorithms on synthetic data and compare performance with several popular community detection algorithms. We also test the method on real data sets of movie categorization from the Movie Lens database. In all of the cases our algorithm is able to identify the categories for a suitable choice of tuning parameter.

NASep 19, 2016

Tensor Completion by Alternating Minimization under the Tensor Train (TT) Model

Wenqi Wang, Vaneet Aggarwal, Shuchin Aeron

Using the matrix product state (MPS) representation of tensor train decompositions, in this paper we propose a tensor completion algorithm which alternates over the matrices (tensors) in the MPS representation. This development is motivated in part by the success of matrix completion algorithms which alternate over the (low-rank) factors. We comment on the computational complexity of the proposed algorithm and numerically compare it with existing methods employing low rank tensor train approximation for data completion as well as several other recently proposed methods. We show that our method is superior to existing ones for a variety of real settings.

ITJul 11, 2016

On Deterministic Conditions for Subspace Clustering under Missing Data

Wenqi Wang, Shuchin Aeron, Vaneet Aggarwal

In this paper we present deterministic conditions for success of sparse subspace clustering (SSC) under missing data, when data is assumed to come from a Union of Subspaces (UoS) model. We consider two algorithms, which are variants of SSC with entry-wise zero-filling that differ in terms of the optimization problems used to find affinity matrix for spectral clustering. For both the algorithms, we provide deterministic conditions for any pattern of missing data such that perfect clustering can be achieved. We provide extensive sets of simulation results for clustering as well as completion of data at missing entries, under the UoS model. Our experimental results indicate that in contrast to the full data case, accurate clustering does not imply accurate subspace identification and completion, indicating the natural order of relative hardness of these problems.

ITApr 15, 2016

On deterministic conditions for subspace clustering under missing data

Wenqi Wang, Shuchin Aeron, Vaneet Aggarwal

In this paper we present deterministic analysis of sufficient conditions for sparse subspace clustering under missing data, when data is assumed to come from a Union of Subspaces (UoS) model. In this context we consider two cases, namely Case I when all the points are sampled at the same co-ordinates, and Case II when points are sampled at different locations. We show that results for Case I directly follow from several existing results in the literature, while results for Case II are not as straightforward and we provide a set of dual conditions under which, perfect clustering holds true. We provide extensive set of simulation results for clustering as well as completion of data under missing entries, under the UoS model. Our experimental results indicate that in contrast to the full data case, accurate clustering does not imply accurate subspace identification and completion, indicating the natural order of relative hardness of these problems.