LGJan 2, 2023
Deep Clustering of Tabular Data by Weighted Gaussian Distribution LearningShourav B. Rabbani, Ivan V. Medri, Manar D. Samad
Deep learning methods are primarily proposed for supervised learning of images or text with limited applications to clustering problems. In contrast, tabular data with heterogeneous features pose unique challenges in representation learning, where deep learning has yet to replace traditional machine learning. This paper addresses these challenges in developing one of the first deep clustering methods for tabular data: Gaussian Cluster Embedding in Autoencoder Latent Space (G-CEALS). G-CEALS is an unsupervised deep clustering framework for learning the parameters of multivariate Gaussian cluster distributions by iteratively updating individual cluster weights. The G-CEALS method presents average rank orderings of 2.9(1.7) and 2.8(1.7) based on clustering accuracy and adjusted Rand index (ARI) scores on sixteen tabular data sets, respectively, and outperforms nine state-of-the-art clustering methods. G-CEALS substantially improves clustering performance compared to traditional K-means and GMM, which are still de facto methods for clustering tabular data. Similar computationally efficient and high-performing deep clustering frameworks are imperative to reap the myriad benefits of deep learning on tabular data over traditional machine learning.
64.6OCMar 30
Gromov-Wasserstein Barycenters: The Analysis ProblemRocío Díaz Martín, Ivan V. Medri, James M. Murphy
This paper considers the problem of estimating a matrix that encodes pairwise distances in a finite metric space (or, more generally, the edge weight matrix of a network) under the barycentric coding model (BCM) with respect to the Gromov-Wasserstein (GW) distance function. We frame this task as estimating the unknown barycentric coordinates with respect to the GW distance, assuming that the target matrix (or kernel) belongs to the set of GW barycenters of a finite collection of known templates. In the language of harmonic analysis, if computing GW barycenters can be viewed as a synthesis problem, this paper aims to solve the corresponding analysis problem. We propose two methods: one utilizing fixed-point iteration for computing GW barycenters, and another employing a differentiation-based approach to the GW structure using a blow-up technique. Finally, we demonstrate the application of the proposed GW analysis approach in a series of numerical experiments and applications to machine learning.
LGJan 8, 2024
Attention versus Contrastive Learning of Tabular Data -- A Data-centric BenchmarkingShourav B. Rabbani, Ivan V. Medri, Manar D. Samad
Despite groundbreaking success in image and text learning, deep learning has not achieved significant improvements against traditional machine learning (ML) when it comes to tabular data. This performance gap underscores the need for data-centric treatment and benchmarking of learning algorithms. Recently, attention and contrastive learning breakthroughs have shifted computer vision and natural language processing paradigms. However, the effectiveness of these advanced deep models on tabular data is sparsely studied using a few data sets with very large sample sizes, reporting mixed findings after benchmarking against a limited number of baselines. We argue that the heterogeneity of tabular data sets and selective baselines in the literature can bias the benchmarking outcomes. This article extensively evaluates state-of-the-art attention and contrastive learning methods on a wide selection of 28 tabular data sets (14 easy and 14 hard-to-classify) against traditional deep and machine learning. Our data-centric benchmarking demonstrates when traditional ML is preferred over deep learning and vice versa because no best learning method exists for all tabular data sets. Combining between-sample and between-feature attentions conquers the invincible traditional ML on tabular data sets by a significant margin but fails on high dimensional data, where contrastive learning takes a robust lead. While a hybrid attention-contrastive learning strategy mostly wins on hard-to-classify data sets, traditional methods are frequently superior on easy-to-classify data sets with presumably simpler decision boundaries. To the best of our knowledge, this is the first benchmarking paper with statistical analyses of attention and contrastive learning performances on a diverse selection of tabular data sets against traditional deep and machine learning baselines to facilitate further advances in this field.