Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering
For practitioners analyzing tabular data in domains like finance and healthcare, this work addresses the limitation of ignoring feature semantics, but the improvement is incremental as it combines existing techniques (LLMs, contrastive learning, clustering).
Existing deep clustering methods for tabular data rely on statistical co-occurrence and ignore semantic knowledge in feature names/values, causing conceptually related samples to be isolated. TagCC uses LLMs to create textual anchors and contrastive learning to enrich representations with open-world semantics, significantly outperforming baselines on benchmarks.
Deep Clustering (DC) has emerged as a powerful tool for tabular data analysis in real-world domains like finance and healthcare. However, most existing methods rely on data-level statistical co-occurrence to infer the latent metric space, often overlooking the intrinsic semantic knowledge encapsulated in feature names and values. As a result, semantically related concepts like `Flu' and `Cold' are often treated as symbolic tokens, causing conceptually related samples to be isolated. To bridge the gap between dataset-specific statistics and intrinsic semantic knowledge, this paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that anchors statistical tabular representations to open-world textual concepts. Specifically, TagCC utilizes Large Language Models (LLMs) to distill underlying data semantics into textual anchors via semantic-aware transformation. Through Contrastive Learning (CL), the framework enriches the statistical tabular representations with the open-world semantics encapsulated in these anchors. This CL framework is jointly optimized with a clustering objective, ensuring that the learned representations are both semantically coherent and clustering-friendly. Extensive experiments on benchmark datasets demonstrate that TagCC significantly outperforms its counterparts.