Large Scale Distributed Distance Metric Learning
This work addresses the scalability problem for DML in machine learning and data mining, enabling its application to modern big data scenarios.
The paper tackles the prohibitive computational cost of Distance Metric Learning (DML) for high-dimensional, large-scale data by presenting a distributed algorithm and implementation on a parameter server architecture, achieving completion of a task on a dataset with 1 million points, 22,000 features, and 200 million labeled pairs in 15 hours on 256 CPU cores.
In large scale machine learning and data mining problems with high feature dimensionality, the Euclidean distance between data points can be uninformative, and Distance Metric Learning (DML) is often desired to learn a proper similarity measure (using side information such as example data pairs being similar or dissimilar). However, high dimensionality and large volume of pairwise constraints in modern big data can lead to prohibitive computational cost for both the original DML formulation in Xing et al. (2002) and later extensions. In this paper, we present a distributed algorithm for DML, and a large-scale implementation on a parameter server architecture. Our approach builds on a parallelizable reformulation of Xing et al. (2002), and an asynchronous stochastic gradient descent optimization procedure. To our knowledge, this is the first distributed solution to DML, and we show that, on a system with 256 CPU cores, our program is able to complete a DML task on a dataset with 1 million data points, 22-thousand features, and 200 million labeled data pairs, in 15 hours; and the learned metric shows great effectiveness in properly measuring distances.