GPU-Accelerated Optimizer-Aware Evaluation of Submodular Exemplar Clustering
This work addresses the real-time performance bottleneck in exemplar-based clustering for practical applications, representing an incremental improvement in computational efficiency.
The paper tackled the high computational complexity of submodular exemplar clustering by developing a GPU-accelerated algorithm that reduces wall-clock run-time, achieving speedups of up to 72x compared to multi-threaded CPU computations and up to 452x for half-precision GPU vs. single-thread CPU.
The optimization of submodular functions constitutes a viable way to perform clustering. Strong approximation guarantees and feasible optimization w.r.t. streaming data make this clustering approach favorable. Technically, submodular functions map subsets of data to real values, which indicate how "representative" a specific subset is. Optimal sets might then be used to partition the data space and to infer clusters. Exemplar-based clustering is one of the possible submodular functions, but suffers from high computational complexity. However, for practical applications, the particular real-time or wall-clock run-time is decisive. In this work, we present a novel way to evaluate this particular function on GPUs, which keeps the necessities of optimizers in mind and reduces wall-clock run-time. To discuss our GPU algorithm, we investigated both the impact of different run-time critical problem properties, like data dimensionality and the number of data points in a subset, and the influence of required floating-point precision. In reproducible experiments, our GPU algorithm was able to achieve competitive speedups of up to 72x depending on whether multi-threaded computation on CPUs was used for comparison and the type of floating-point precision required. Half-precision GPU computation led to large speedups of up to 452x compared to single-precision, single-thread CPU computations.