Superior Parallel Big Data Clustering through Competitive Stochastic Sample Size Optimization in Big-means
This work addresses scalability issues for big data clustering applications, but it is incremental as it builds upon the existing Big-means methodology.
The paper tackles scalability and computation time challenges in K-means clustering for big data by introducing a parallel algorithm that dynamically adjusts sample sizes with competitive optimization, resulting in improved efficiency and clustering quality.
This paper introduces a novel K-means clustering algorithm, an advancement on the conventional Big-means methodology. The proposed method efficiently integrates parallel processing, stochastic sampling, and competitive optimization to create a scalable variant designed for big data applications. It addresses scalability and computation time challenges typically faced with traditional techniques. The algorithm adjusts sample sizes dynamically for each worker during execution, optimizing performance. Data from these sample sizes are continually analyzed, facilitating the identification of the most efficient configuration. By incorporating a competitive element among workers using different sample sizes, efficiency within the Big-means algorithm is further stimulated. In essence, the algorithm balances computational time and clustering quality by employing a stochastic, competitive sampling strategy in a parallel computing setting.