LGNov 9, 2023Code
Hard-Negative Sampling for Contrastive Learning: Optimal Representation Geometry and Neural- vs Dimensional-CollapseRuijie Jiang, Thuan Nguyen, Shuchin Aeron et al.
For a widely-studied data model and general loss and sample-hardening functions we prove that the losses of Supervised Contrastive Learning (SCL), Hard-SCL (HSCL), and Unsupervised Contrastive Learning (UCL) are minimized by representations that exhibit Neural-Collapse (NC), i.e., the class means form an Equiangular Tight Frame (ETF) and data from the same class are mapped to the same representation. We also prove that for any representation mapping, the HSCL and Hard-UCL (HUCL) losses are lower bounded by the corresponding SCL and UCL losses. In contrast to existing literature, our theoretical results for SCL do not require class-conditional independence of augmented views and work for a general loss function class that includes the widely used InfoNCE loss function. Moreover, our proofs are simpler, compact, and transparent. Similar to existing literature, our theoretical claims also hold for the practical scenario where batching is used for optimization. We empirically demonstrate, for the first time, that Adam optimization (with batching) of HSCL and HUCL losses with random initialization and suitable hardness levels can indeed converge to the NC-geometry if we incorporate unit-ball or unit-sphere feature normalization. Without incorporating hard-negatives or feature normalization, however, the representations learned via Adam suffer from Dimensional-Collapse (DC) and fail to attain the NC-geometry. These results exemplify the role of hard-negative sampling in contrastive representation learning and we conclude with several open theoretical problems for future work. The code can be found at https://github.com/rjiang03/HCL/tree/main
LGAug 31, 2022
Supervised Contrastive Learning with Hard Negative SamplesRuijie Jiang, Thuan Nguyen, Prakash Ishwar et al.
Through minimization of an appropriate loss function such as the InfoNCE loss, contrastive learning (CL) learns a useful representation function by pulling positive samples close to each other while pushing negative samples far apart in the embedding space. The positive samples are typically created using "label-preserving" augmentations, i.e., domain-specific transformations of a given datum or anchor. In absence of class information, in unsupervised CL (UCL), the negative samples are typically chosen randomly and independently of the anchor from a preset negative sampling distribution over the entire dataset. This leads to class-collisions in UCL. Supervised CL (SCL), avoids this class collision by conditioning the negative sampling distribution to samples having labels different from that of the anchor. In hard-UCL (H-UCL), which has been shown to be an effective method to further enhance UCL, the negative sampling distribution is conditionally tilted, by means of a hardening function, towards samples that are closer to the anchor. Motivated by this, in this paper we propose hard-SCL (H-SCL) {wherein} the class conditional negative sampling distribution {is tilted} via a hardening function. Our simulation results confirm the utility of H-SCL over SCL with significant performance gains {in downstream classification tasks.} Analytically, we show that {in the} limit of infinite negative samples per anchor and a suitable assumption, the {H-SCL loss} is upper bounded by the {H-UCL loss}, thereby justifying the utility of H-UCL {for controlling} the H-SCL loss in the absence of label information. Through experiments on several datasets, we verify the assumption as well as the claimed inequality between H-UCL and H-SCL losses. We also provide a plausible scenario where H-SCL loss is lower bounded by UCL loss, indicating the limited utility of UCL in controlling the H-SCL loss.
LGAug 1, 2022
Joint covariate-alignment and concept-alignment: a framework for domain generalizationThuan Nguyen, Boyang Lyu, Prakash Ishwar et al.
In this paper, we propose a novel domain generalization (DG) framework based on a new upper bound to the risk on the unseen domain. Particularly, our framework proposes to jointly minimize both the covariate-shift as well as the concept-shift between the seen domains for a better performance on the unseen domain. While the proposed approach can be implemented via an arbitrary combination of covariate-alignment and concept-alignment modules, in this work we use well-established approaches for distributional alignment namely, Maximum Mean Discrepancy (MMD) and covariance Alignment (CORAL), and use an Invariant Risk Minimization (IRM)-based approach for concept alignment. Our numerical results show that the proposed methods perform as well as or better than the state-of-the-art for domain generalization on several data sets.
LGApr 2, 2023
A principled approach to model validation in domain generalizationBoyang Lyu, Thuan Nguyen, Matthias Scheutz et al.
Domain generalization aims to learn a model with good generalization ability, that is, the learned model should not only perform well on several seen domains but also on unseen domains with different data distributions. State-of-the-art domain generalization methods typically train a representation function followed by a classifier jointly to minimize both the classification risk and the domain discrepancy. However, when it comes to model selection, most of these methods rely on traditional validation routines that select models solely based on the lowest classification risk on the validation set. In this paper, we theoretically demonstrate a trade-off between minimizing classification risk and mitigating domain discrepancy, i.e., it is impossible to achieve the minimum of these two objectives simultaneously. Motivated by this theoretical result, we propose a novel model selection method suggesting that the validation process should account for both the classification risk and the domain discrepancy. We validate the effectiveness of the proposed method by numerical results on several domain generalization datasets.
LGOct 26, 2022
Trade-off between reconstruction loss and feature alignment for domain generalizationThuan Nguyen, Boyang Lyu, Prakash Ishwar et al.
Domain generalization (DG) is a branch of transfer learning that aims to train the learning models on several seen domains and subsequently apply these pre-trained models to other unseen (unknown but related) domains. To deal with challenging settings in DG where both data and label of the unseen domain are not available at training time, the most common approach is to design the classifiers based on the domain-invariant representation features, i.e., the latent representations that are unchanged and transferable between domains. Contrary to popular belief, we show that designing classifiers based on invariant representation features alone is necessary but insufficient in DG. Our analysis indicates the necessity of imposing a constraint on the reconstruction loss induced by representation functions to preserve most of the relevant information about the label in the latent space. More importantly, we point out the trade-off between minimizing the reconstruction loss and achieving domain alignment in DG. Our theoretical results motivate a new DG framework that jointly optimizes the reconstruction loss and the domain discrepancy. Both theoretical and numerical results are provided to justify our approach.
48.9LGMay 11
Optimal Representations for Generalized Contrastive Learning with Imbalanced DatasetsThuan Nguyen, Shuchin Aeron, D. Richard Brown et al.
In this paper, we provide a computable characterization of the geometry of optimal representations in Contrastive Learning (CL) when the classes are imbalanced. When classes are balanced and the representation dimension is greater than the number of classes, it is well-known that the optimal representations exhibit Neural Collapse (NC), i.e., representations from the same class collapse to their class means and the class means form an Equiangular Tight Frame (ETF). For imbalanced classes and a large, generalized family of CL losses, we prove that the optimal representations of all samples from the same class collapse to their class means and their geometry exhibits an angular symmetry structure that is determined by the relative class proportions. In general, we show that the geometry can be determined by solving a convex optimization problem. Exploiting this symmetry structure, we analytically investigate a special case where class imbalance is extreme and prove that CL exhibits a phenomenon called Minority Collapse (MC) where all samples from the minority classes (classes with small probabilities) collapse into a single vector, whenever the class imbalance exceeds a threshold, which in turn depends on the regularity properties of the CL loss used and on the number of negative samples. Numerical results are provided to illustrate these phenomena and corroborate the theoretical results. We conclude by identifying a number of open problems.
LGJan 25, 2022
Conditional entropy minimization principle for learning domain invariant representation featuresThuan Nguyen, Boyang Lyu, Prakash Ishwar et al.
Invariance-principle-based methods such as Invariant Risk Minimization (IRM), have recently emerged as promising approaches for Domain Generalization (DG). Despite promising theory, such approaches fail in common classification tasks due to the mixing of true invariant features and spurious invariant features. To address this, we propose a framework based on the conditional entropy minimization (CEM) principle to filter-out the spurious invariant features leading to a new algorithm with a better generalization capability. We show that our proposed approach is closely related to the well-known Information Bottleneck (IB) framework and prove that under certain assumptions, entropy minimization can exactly recover the true invariant features. Our approach provides competitive classification accuracy compared to recent theoretically-principled state-of-the-art alternatives across several DG datasets.
LGSep 4, 2021
Barycentric-alignment and reconstruction loss minimization for domain generalizationBoyang Lyu, Thuan Nguyen, Prakash Ishwar et al.
This paper advances the theory and practice of Domain Generalization (DG) in machine learning. We consider the typical DG setting where the hypothesis is composed of a representation mapping followed by a labeling function. Within this setting, the majority of popular DG methods aim to jointly learn the representation and the labeling functions by minimizing a well-known upper bound for the classification risk in the unseen domain. In practice, however, methods based on this theoretical upper bound ignore a term that cannot be directly optimized due to its dual dependence on both the representation mapping and the unknown optimal labeling function in the unseen domain. To bridge this gap between theory and practice, we introduce a new upper bound that is free of terms having such dual dependence, resulting in a fully optimizable risk upper bound for the unseen domain. Our derivation leverages classical and recent transport inequalities that link optimal transport metrics with information-theoretic measures. Compared to previous bounds, our bound introduces two new terms: (i) the Wasserstein-2 barycenter term that aligns distributions between domains, and (ii) the reconstruction loss term that assesses the quality of representation in reconstructing the original data. Based on this new upper bound, we propose a novel DG algorithm named Wasserstein Barycenter Auto-Encoder (WBAE) that simultaneously minimizes the classification loss, the barycenter loss, and the reconstruction loss. Numerical results demonstrate that the proposed method outperforms current state-of-the-art DG algorithms on several datasets.
SPJan 7, 2020
On the Uniqueness of Binary Quantizers for Maximizing Mutual InformationThuan Nguyen, Thinh Nguyen
We consider a channel with a binary input X being corrupted by a continuous-valued noise that results in a continuous-valued output Y. An optimal binary quantizer is used to quantize the continuous-valued output Y to the final binary output Z to maximize the mutual information I(X; Z). We show that when the ratio of the channel conditional density r(y) = P(Y=y|X=0)/ P(Y =y|X=1) is a strictly increasing/decreasing function of y, then a quantizer having a single threshold can maximize mutual information. Furthermore, we show that an optimal quantizer (possibly with multiple thresholds) is the one with the thresholding vector whose elements are all the solutions of r(y) = r* for some constant r* > 0. Interestingly, the optimal constant r* is unique. This uniqueness property allows for fast algorithmic implementation such as a bisection algorithm to find the optimal quantizer. Our results also confirm some previous results using alternative elementary proofs. We show some numerical examples of applying our results to channels with additive Gaussian noises.
ITJan 6, 2020
Communication-Channel Optimized PartitionThuan Nguyen, Thinh Nguyen
Given an original discrete source X with the distribution p_X that is corrupted by noise to produce the noisy data Y with the given joint distribution p(X, Y). A quantizer/classifier Q : Y -> Z is then used to classify/quantize the data Y to the discrete partitioned output Z with probability distribution p_Z. Next, Z is transmitted over a deterministic channel with a given channel matrix A that produces the final discrete output T. One wants to design the optimal quantizer/classifier Q^* such that the cost function F(X; T) between the input X and the final output T is minimized while the probability of the partitioned output Z satisfies a concave constraint G(p_Z) < C. Our results generalized some famous previous results. First, an iteration linear time complexity algorithm is proposed to find the local optimal quantizer. Second, we show that the optimal partition should produce a hard partition that is equivalent to the cuts by hyper-planes in the probability space of the posterior probability p(X|Y). This result finally provides a polynomial-time algorithm to find the globally optimal quantizer.
ITDec 31, 2019
Minimizing Impurity Partition Under ConstraintsThuan Nguyen, Thinh Nguyen
Set partitioning is a key component of many algorithms in machine learning, signal processing, and communications. In general, the problem of finding a partition that minimizes a given impurity (loss function) is NP-hard. As such, there exists a wealth of literature on approximate algorithms and theoretical analyses of the partitioning problem under different settings. In this paper, we formulate and solve a variant of the partition problem called the minimum impurity partition under constraint (MIPUC). MIPUC finds an optimal partition that minimizes a given loss function under a given concave constraint. MIPUC generalizes the recently proposed deterministic information bottleneck problem which finds an optimal partition that maximizes the mutual information between the input and partition output while minimizing the partition output entropy. Our proposed algorithm is developed based on a novel optimality condition, which allows us to find a locally optimal solution efficiently. Moreover, we show that the optimal partition produces a hard partition that is equivalent to the cuts by hyperplanes in the probability space of the posterior probability that finally yields a polynomial time complexity algorithm to find the globally optimal partition. Both theoretical and numerical results are provided to validate the proposed algorithm.