LG IT MLOct 23, 2020

Quantizing Multiple Sources to a Common Cluster Center: An Asymptotic Analysis

arXiv:2010.12546v11.2

Originality Incremental advance

AI Analysis

This addresses the problem of improving clustering accuracy for noisy data in machine learning, though it is incremental as it builds on existing center-based clustering methods.

The paper tackles the problem of clustering datasets with multiple noisy observations per member by quantizing concatenated vectors to a common cluster center, deriving an asymptotic formula for average distortion and providing an optimization algorithm. It shows that this approach outperforms naive quantization in faithfulness to the original dataset, with performance gains verified on real and artificial datasets.

We consider quantizing an $Ld$-dimensional sample, which is obtained by concatenating $L$ vectors from datasets of $d$-dimensional vectors, to a $d$-dimensional cluster center. The distortion measure is the weighted sum of $r$th powers of the distances between the cluster center and the samples. For $L=1$, one recovers the ordinary center based clustering formulation. The general case $L>1$ appears when one wishes to cluster a dataset through $L$ noisy observations of each of its members. We find a formula for the average distortion performance in the asymptotic regime where the number of cluster centers are large. We also provide an algorithm to numerically optimize the cluster centers and verify our analytical results on real and artificial datasets. In terms of faithfulness to the original (noiseless) dataset, our clustering approach outperforms the naive approach that relies on quantizing the $Ld$-dimensional noisy observation vectors to $Ld$-dimensional centers.

View on arXiv PDF

Similar