Clustering of Big Data with Mixed Features
This addresses a central problem in data mining for big data applications, but it appears incremental as it builds on existing peak-finding techniques.
The paper tackles the problem of clustering large datasets with mixed data types, which is challenging due to sensitivity to initialization, detection of only spherical clusters, and unknown cluster numbers. The result is a new algorithm that improves applicability and efficiency, capable of handling mixed data, detecting outliers and lower-density clusters, and determining the correct number of clusters, with experimental verification.
Clustering large, mixed data is a central problem in data mining. Many approaches adopt the idea of k-means, and hence are sensitive to initialisation, detect only spherical clusters, and require a priori the unknown number of clusters. We here develop a new clustering algorithm for large data of mixed type, aiming at improving the applicability and efficiency of the peak-finding technique. The improvements are threefold: (1) the new algorithm is applicable to mixed data; (2) the algorithm is capable of detecting outliers and clusters of relatively lower density values; (3) the algorithm is competent at deciding the correct number of clusters. The computational complexity of the algorithm is greatly reduced by applying a fast k-nearest neighbors method and by scaling down to component sets. We present experimental results to verify that our algorithm works well in practice. Keywords: Clustering; Big Data; Mixed Attribute; Density Peaks; Nearest-Neighbor Graph; Conductance.