Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models

Zihua Yang, Xin Liao, Yiqun Zhang, Yiu-ming Cheung

arXiv:2601.011620.23h-index: 6Has Code

AI Analysis85

This work addresses a core challenge in categorical data clustering for domains like healthcare and bioinformatics, offering a novel integration of external semantic knowledge to improve accuracy.

The paper tackles the problem of clustering categorical data by addressing the semantic gap in similarity measures, using Large Language Models to enhance representations and achieving improvements of 19-27% over existing methods on benchmark datasets.

Categorical data are prevalent in domains such as healthcare, marketing, and bioinformatics, where clustering serves as a fundamental tool for pattern discovery. A core challenge in categorical data clustering lies in measuring similarity among attribute values that lack inherent ordering or distance. Without appropriate similarity measures, values are often treated as equidistant, creating a semantic gap that obscures latent structures and degrades clustering quality. Although existing methods infer value relationships from within-dataset co-occurrence patterns, such inference becomes unreliable when samples are limited, leaving the semantic context of the data underexplored. To bridge this gap, we present ARISE (Attention-weighted Representation with Integrated Semantic Embeddings), which draws on external semantic knowledge from Large Language Models (LLMs) to construct semantic-aware representations that complement the metric space of categorical data for accurate clustering. That is, LLM is adopted to describe attribute values for representation enhancement, and the LLM-enhanced embeddings are combined with the original data to explore semantically prominent clusters. Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts, with gains of 19-27%. Code is available at https://github.com/develop-yang/ARISE

View on arXiv PDF Code

Similar