Georgios Vardakas

LG
h-index50
6papers
86citations
Novelty53%
AI Score40

6 Papers

LGJun 3
UniFair: A unified fair clustering approach based on separation and compactness

Antonia Karra, Vasiliki Papanikou, Georgios Vardakas et al.

Clustering is increasingly used to support high-impact decisions, yet standard objectives such as $k$-means can produce clusterings that treat demographic groups unequally. Existing fair clustering methods typically optimize a single notion of fairness and often overlook how clustering costs interact with the geometry of the induced decision boundaries. We propose \textsc{UniFair}, a unified framework that jointly optimizes \emph{separation fairness} and \emph{social fairness}. Separation fairness encourages protected groups to lie farther from the induced decision boundaries, while social fairness reduces disparities in within-cluster distortion by penalizing group-wise clustering costs. We develop gradient-based optimization procedures for separation-fair and unified $k$-means objectives, and extend them to deep clustering by enforcing the same criteria in the latent space of an autoencoder. Experiments on tabular and image datasets show that \textsc{UniFair} reduces both boundary-related and cost-based group disparities with only a modest increase in clustering loss.

LGNov 22, 2022
Global $k$-means$++$: an effective relaxation of the global $k$-means clustering algorithm

Georgios Vardakas, Aristidis Likas

The $k$-means algorithm is a prevalent clustering method due to its simplicity, effectiveness, and speed. However, its main disadvantage is its high sensitivity to the initial positions of the cluster centers. The global $k$-means is a deterministic algorithm proposed to tackle the random initialization problem of k-means but its well-known that requires high computational cost. It partitions the data to $K$ clusters by solving all $k$-means sub-problems incrementally for all $k=1,\ldots, K$. For each $k$ cluster problem, the method executes the $k$-means algorithm $N$ times, where $N$ is the number of datapoints. In this paper, we propose the \emph{global $k$-means\texttt{++}} clustering algorithm, which is an effective way of acquiring quality clustering solutions akin to those of global $k$-means with a reduced computational load. This is achieved by exploiting the center selection probability that is effectively used in the $k$-means\texttt{++} algorithm. The proposed method has been tested and compared in various benchmark datasets yielding very satisfactory results in terms of clustering quality and execution speed.

LGFeb 1, 2024
Deep Clustering Using the Soft Silhouette Score: Towards Compact and Well-Separated Clusters

Georgios Vardakas, Ioannis Papakostas, Aristidis Likas

Unsupervised learning has gained prominence in the big data era, offering a means to extract valuable insights from unlabeled datasets. Deep clustering has emerged as an important unsupervised category, aiming to exploit the non-linear mapping capabilities of neural networks in order to enhance clustering performance. The majority of deep clustering literature focuses on minimizing the inner-cluster variability in some embedded space while keeping the learned representation consistent with the original high-dimensional dataset. In this work, we propose soft silhoutte, a probabilistic formulation of the silhouette coefficient. Soft silhouette rewards compact and distinctly separated clustering solutions like the conventional silhouette coefficient. When optimized within a deep clustering framework, soft silhouette guides the learned representations towards forming compact and well-separated clusters. In addition, we introduce an autoencoder-based deep learning architecture that is suitable for optimizing the soft silhouette objective function. The proposed deep clustering method has been tested and compared with several well-studied deep clustering methods on various benchmark datasets, yielding very satisfactory clustering results.

LGJan 11, 2024
Revisiting Silhouette Aggregation

John Pavlopoulos, Georgios Vardakas, Aristidis Likas

Silhouette coefficient is an established internal clustering evaluation measure that produces a score per data point, assessing the quality of its clustering assignment. To assess the quality of the clustering of the whole dataset, the scores of all the points in the dataset are typically (micro) averaged into a single value. An alternative path, however, that is rarely employed, is to average first at the cluster level and then (macro) average across clusters. As we illustrate in this work with a synthetic example, the typical micro-averaging strategy is sensitive to cluster imbalance while the overlooked macro-averaging strategy is far more robust. By investigating macro-Silhouette further, we find that uniform sub-sampling, the only available strategy in existing libraries, harms the measure's robustness against imbalance. We address this issue by proposing a per-cluster sampling method. An experimental study on eight real-world datasets is then used to analyse both coefficients in two clustering tasks.

LGDec 18, 2023
UniForCE: The Unimodality Forest Method for Clustering and Estimation of the Number of Clusters

Georgios Vardakas, Argyris Kalogeratos, Aristidis Likas

Estimating the number of clusters k while clustering the data is a challenging task. An incorrect cluster assumption indicates that the number of clusters k gets wrongly estimated. Consequently, the model fitting becomes less important. In this work, we focus on the concept of unimodality and propose a flexible cluster definition called locally unimodal cluster. A locally unimodal cluster extends for as long as unimodality is locally preserved across pairs of subclusters of the data. Then, we propose the UniForCE method for locally unimodal clustering. The method starts with an initial overclustering of the data and relies on the unimodality graph that connects subclusters forming unimodal pairs. Such pairs are identified using an appropriate statistical test. UniForCE identifies maximal locally unimodal clusters by computing a spanning forest in the unimodality graph. Experimental results on both real and synthetic datasets illustrate that the proposed methodology is particularly flexible and robust in discovering regular and highly complex cluster shapes. Most importantly, it automatically provides an adequate estimation of the number of clusters.

LGJan 17, 2025
Counterfactual Explanations for k-means and Gaussian Clustering

Georgios Vardakas, Antonia Karra, Evaggelia Pitoura et al.

Counterfactuals have been recognized as an effective approach to explain classifier decisions. Nevertheless, they have not yet been considered in the context of clustering. In this work, we propose the use of counterfactuals to explain clustering solutions. First, we present a general definition for counterfactuals for model-based clustering that includes plausibility and feasibility constraints. Then we consider the counterfactual generation problem for k-means and Gaussian clustering assuming Euclidean distance. Our approach takes as input the factual, the target cluster, a binary mask indicating actionable or immutable features and a plausibility factor specifying how far from the cluster boundary the counterfactual should be placed. In the k-means clustering case, analytical mathematical formulas are presented for computing the optimal solution, while in the Gaussian clustering case (assuming full, diagonal, or spherical covariances) our method requires the numerical solution of a nonlinear equation with a single parameter only. We demonstrate the advantages of our approach through illustrative examples and quantitative experimental comparisons.