An Efficient $k$-modes Algorithm for Clustering Categorical Datasets
This provides an improved algorithm for researchers and practitioners needing efficient clustering of categorical datasets, though it is incremental as it builds on the existing k-modes method.
The paper tackles the problem of clustering categorical data by introducing OTQT, a novel and computationally efficient implementation of the k-modes algorithm, which is proven to find undetectable updates and is more accurate per iteration, often faster to the optimum than existing methods.
Mining clusters from data is an important endeavor in many applications. The $k$-means method is a popular, efficient, and distribution-free approach for clustering numerical-valued data, but does not apply for categorical-valued observations. The $k$-modes method addresses this lacuna by replacing the Euclidean with the Hamming distance and the means with the modes in the $k$-means objective function. We provide a novel, computationally efficient implementation of $k$-modes, called OTQT. We prove that OTQT finds updates to improve the objective function that are undetectable to existing $k$-modes algorithms. Although slightly slower per iteration due to algorithmic complexity, OTQT is always more accurate per iteration and almost always faster (and only barely slower on some datasets) to the final optimum. Thus, we recommend OTQT as the preferred, default algorithm for $k$-modes optimization.