DSJul 3, 2022
An Empirical Evaluation of $k$-Means CoresetsChris Schwiegelshohn, Omar Ali Sheikh-Omar
Coresets are among the most popular paradigms for summarizing data. In particular, there exist many high performance coresets for clustering problems such as $k$-means in both theory and practice. Curiously, there exists no work on comparing the quality of available $k$-means coresets. In this paper we perform such an evaluation. There currently is no algorithm known to measure the distortion of a candidate coreset. We provide some evidence as to why this might be computationally difficult. To complement this, we propose a benchmark for which we argue that computing coresets is challenging and which also allows us an easy (heuristic) evaluation of coresets. Using this benchmark and real-world data sets, we conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice.
CGNov 15, 2022
Improved Coresets for Euclidean $k$-MeansVincent Cohen-Addad, Kasper Green Larsen, David Saulpic et al.
Given a set of $n$ points in $d$ dimensions, the Euclidean $k$-means problem (resp. the Euclidean $k$-median problem) consists of finding $k$ centers such that the sum of squared distances (resp. sum of distances) from every point to its closest center is minimized. The arguably most popular way of dealing with this problem in the big data setting is to first compress the data by computing a weighted subset known as a coreset and then run any algorithm on this subset. The guarantee of the coreset is that for any candidate solution, the ratio between coreset cost and the cost of the original instance is less than a $(1\pm \varepsilon)$ factor. The current state of the art coreset size is $\tilde O(\min(k^{2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-4}))$ for Euclidean $k$-means and $\tilde O(\min(k^{2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-3}))$ for Euclidean $k$-median. The best known lower bound for both problems is $Ω(k \varepsilon^{-2})$. In this paper, we improve the upper bounds $\tilde O(\min(k^{3/2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-4}))$ for $k$-means and $\tilde O(\min(k^{4/3} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-3}))$ for $k$-median. In particular, ours is the first provable bound that breaks through the $k^2$ barrier while retaining an optimal dependency on $\varepsilon$.
LGApr 23, 2020
Supervised Domain Adaptation: A Graph Embedding Perspective and a Rectified Experimental ProtocolLukas Hedegaard, Omar Ali Sheikh-Omar, Alexandros Iosifidis
Domain Adaptation is the process of alleviating distribution gaps between data from different domains. In this paper, we show that Domain Adaptation methods using pair-wise relationships between source and target domain data can be formulated as a Graph Embedding in which the domain labels are incorporated into the structure of the intrinsic and penalty graphs. Specifically, we analyse the loss functions of three existing state-of-the-art Supervised Domain Adaptation methods and demonstrate that they perform Graph Embedding. Moreover, we highlight some generalisation and reproducibility issues related to the experimental setup commonly used to demonstrate the few-shot learning capabilities of these methods. To assess and compare Supervised Domain Adaptation methods accurately, we propose a rectified evaluation protocol, and report updated benchmarks on the standard datasets Office31 (Amazon, DSLR, and Webcam), Digits (MNIST, USPS, SVHN, and MNIST-M) and VisDA (Synthetic, Real).
LGMar 9, 2020
Supervised Domain Adaptation using Graph EmbeddingLukas Hedegaard Morsing, Omar Ali Sheikh-Omar, Alexandros Iosifidis
Getting deep convolutional neural networks to perform well requires a large amount of training data. When the available labelled data is small, it is often beneficial to use transfer learning to leverage a related larger dataset (source) in order to improve the performance on the small dataset (target). Among the transfer learning approaches, domain adaptation methods assume that distributions between the two domains are shifted and attempt to realign them. In this paper, we consider the domain adaptation problem from the perspective of dimensionality reduction and propose a generic framework based on graph embedding. Instead of solving the generalised eigenvalue problem, we formulate the graph-preserving criterion as a loss in the neural network and learn a domain-invariant feature transformation in an end-to-end fashion. We show that the proposed approach leads to a powerful Domain Adaptation framework; a simple LDA-inspired instantiation of the framework leads to state-of-the-art performance on two of the most widely used Domain Adaptation benchmarks, Office31 and MNIST to USPS datasets.