LGAug 30, 2024
Categorical data clustering: 25 years beyond K-modesTai Dinh, Wong Hauchi, Philippe Fournier-Viger et al.
The clustering of categorical data is a common and important task in computer science, offering profound implications across a spectrum of applications. Unlike purely numerical data, categorical data often lack inherent ordering as in nominal data, or have varying levels of order as in ordinal data, thus requiring specialized methodologies for efficient organization and analysis. This review provides a comprehensive synthesis of categorical data clustering in the past twenty-five years, starting from the introduction of K-modes. It elucidates the pivotal role of categorical data clustering in diverse fields such as health sciences, natural sciences, social sciences, education, engineering and economics. Practical comparisons are conducted for algorithms having public implementations, highlighting distinguishing clustering methodologies and revealing the performance of recent algorithms on several benchmark categorical datasets. Finally, challenges and opportunities in the field are discussed.
AIDec 25, 2024
Data clustering: a fundamental method in data science and managementTai Dinh, Wong Hauchi, Daniil Lisik et al.
This paper explores the critical role of data clustering in data science, emphasizing its methodologies, tools, and diverse applications. Traditional techniques, such as partitional and hierarchical clustering, are analyzed alongside advanced approaches such as data stream, density-based, graph-based, and model-based clustering for handling complex structured datasets. The paper highlights key principles underpinning clustering, outlines widely used tools and frameworks, introduces the workflow of clustering in data science, discusses challenges in practical implementation, and examines various applications of clustering. By focusing on these foundations and applications, the discussion underscores clustering's transformative potential. The paper concludes with insights into future research directions, emphasizing clustering's role in driving innovation and enabling data-driven decision-making.