Aristidis Likas

LG
h-index50
14papers
109citations
Novelty52%
AI Score50

14 Papers

43.9LGJun 3
UniFair: A unified fair clustering approach based on separation and compactness

Antonia Karra, Vasiliki Papanikou, Georgios Vardakas et al.

Clustering is increasingly used to support high-impact decisions, yet standard objectives such as $k$-means can produce clusterings that treat demographic groups unequally. Existing fair clustering methods typically optimize a single notion of fairness and often overlook how clustering costs interact with the geometry of the induced decision boundaries. We propose \textsc{UniFair}, a unified framework that jointly optimizes \emph{separation fairness} and \emph{social fairness}. Separation fairness encourages protected groups to lie farther from the induced decision boundaries, while social fairness reduces disparities in within-cluster distortion by penalizing group-wise clustering costs. We develop gradient-based optimization procedures for separation-fair and unified $k$-means objectives, and extend them to deep clustering by enforcing the same criteria in the latent space of an autoencoder. Experiments on tabular and image datasets show that \textsc{UniFair} reduces both boundary-related and cost-based group disparities with only a modest increase in clustering loss.

LGNov 22, 2022
Global $k$-means$++$: an effective relaxation of the global $k$-means clustering algorithm

Georgios Vardakas, Aristidis Likas

The $k$-means algorithm is a prevalent clustering method due to its simplicity, effectiveness, and speed. However, its main disadvantage is its high sensitivity to the initial positions of the cluster centers. The global $k$-means is a deterministic algorithm proposed to tackle the random initialization problem of k-means but its well-known that requires high computational cost. It partitions the data to $K$ clusters by solving all $k$-means sub-problems incrementally for all $k=1,\ldots, K$. For each $k$ cluster problem, the method executes the $k$-means algorithm $N$ times, where $N$ is the number of datapoints. In this paper, we propose the \emph{global $k$-means\texttt{++}} clustering algorithm, which is an effective way of acquiring quality clustering solutions akin to those of global $k$-means with a reduced computational load. This is achieved by exploiting the center selection probability that is effectively used in the $k$-means\texttt{++} algorithm. The proposed method has been tested and compared in various benchmark datasets yielding very satisfactory results in terms of clustering quality and execution speed.

21.4LGApr 15
Composite Silhouette: A Subsampling-based Aggregation Strategy

Aggelos Semoglou, Aristidis Likas, John Pavlopoulos

Determining the number of clusters is a central challenge in unsupervised learning, where ground-truth labels are unavailable. The Silhouette coefficient is a widely used internal validation metric for this task, yet its standard micro-averaged form tends to favor larger clusters under size imbalance. Macro-averaging mitigates this bias by weighting clusters equally, but may overemphasize noise from under-represented groups. We introduce Composite Silhouette, an internal criterion for cluster-count selection that aggregates evidence across repeated subsampled clusterings rather than relying on a single partition. For each subsample, micro- and macro-averaged Silhouette scores are combined through an adaptive convex weight determined by their normalized discrepancy and smoothed by a bounded nonlinearity; the final score is then obtained by averaging these subsample-level composites. We establish key properties of the criterion and derive finite-sample concentration guarantees for its subsampling estimate. Experiments on synthetic and real-world datasets show that Composite Silhouette effectively reconciles the strengths of micro- and macro-averaging, yielding more accurate recovery of the ground-truth number of clusters.

MENov 28, 2023
A Multivariate Unimodality Test Harnessing the Dip Statistic of Mahalanobis Distances Over Random Projections

Prodromos Kolyvakis, Aristidis Likas

Unimodality, pivotal in statistical analysis, offers insights into dataset structures and drives sophisticated analytical procedures. While unimodality's confirmation is straightforward for one-dimensional data using methods like Silverman's approach and Hartigans' dip statistic, its generalization to higher dimensions remains challenging. By extrapolating one-dimensional unimodality principles to multi-dimensional spaces through linear random projections and leveraging point-to-point distancing, our method, rooted in $α$-unimodality assumptions, presents a novel multivariate unimodality test named mud-pod. Both theoretical and empirical studies confirm the efficacy of our method in unimodality assessment of multidimensional datasets as well as in estimating the number of clusters.

LGFeb 1, 2024
Deep Clustering Using the Soft Silhouette Score: Towards Compact and Well-Separated Clusters

Georgios Vardakas, Ioannis Papakostas, Aristidis Likas

Unsupervised learning has gained prominence in the big data era, offering a means to extract valuable insights from unlabeled datasets. Deep clustering has emerged as an important unsupervised category, aiming to exploit the non-linear mapping capabilities of neural networks in order to enhance clustering performance. The majority of deep clustering literature focuses on minimizing the inner-cluster variability in some embedded space while keeping the learned representation consistent with the original high-dimensional dataset. In this work, we propose soft silhoutte, a probabilistic formulation of the silhouette coefficient. Soft silhouette rewards compact and distinctly separated clustering solutions like the conventional silhouette coefficient. When optimized within a deep clustering framework, soft silhouette guides the learned representations towards forming compact and well-separated clusters. In addition, we introduce an autoencoder-based deep learning architecture that is suitable for optimizing the soft silhouette objective function. The proposed deep clustering method has been tested and compared with several well-studied deep clustering methods on various benchmark datasets, yielding very satisfactory clustering results.

LGJan 11, 2024
Revisiting Silhouette Aggregation

John Pavlopoulos, Georgios Vardakas, Aristidis Likas

Silhouette coefficient is an established internal clustering evaluation measure that produces a score per data point, assessing the quality of its clustering assignment. To assess the quality of the clustering of the whole dataset, the scores of all the points in the dataset are typically (micro) averaged into a single value. An alternative path, however, that is rarely employed, is to average first at the cluster level and then (macro) average across clusters. As we illustrate in this work with a synthetic example, the typical micro-averaging strategy is sensitive to cluster imbalance while the overlooked macro-averaging strategy is far more robust. By investigating macro-Silhouette further, we find that uniform sub-sampling, the only available strategy in existing libraries, harms the measure's robustness against imbalance. We address this issue by proposing a per-cluster sampling method. An experimental study on eight real-world datasets is then used to analyse both coefficients in two clustering tasks.

LGDec 18, 2023
UniForCE: The Unimodality Forest Method for Clustering and Estimation of the Number of Clusters

Georgios Vardakas, Argyris Kalogeratos, Aristidis Likas

Estimating the number of clusters k while clustering the data is a challenging task. An incorrect cluster assumption indicates that the number of clusters k gets wrongly estimated. Consequently, the model fitting becomes less important. In this work, we focus on the concept of unimodality and propose a flexible cluster definition called locally unimodal cluster. A locally unimodal cluster extends for as long as unimodality is locally preserved across pairs of subclusters of the data. Then, we propose the UniForCE method for locally unimodal clustering. The method starts with an initial overclustering of the data and relies on the unimodality graph that connects subclusters forming unimodal pairs. Such pairs are identified using an appropriate statistical test. UniForCE identifies maximal locally unimodal clusters by computing a spanning forest in the unimodality graph. Experimental results on both real and synthetic datasets illustrate that the proposed methodology is particularly flexible and robust in discovering regular and highly complex cluster shapes. Most importantly, it automatically provides an adequate estimation of the number of clusters.

CLJun 18, 2025
TopClustRAG at SIGIR 2025 LiveRAG Challenge

Juli Bakagianni, John Pavlopoulos, Aristidis Likas

We present TopClustRAG, a retrieval-augmented generation (RAG) system developed for the LiveRAG Challenge, which evaluates end-to-end question answering over large-scale web corpora. Our system employs a hybrid retrieval strategy combining sparse and dense indices, followed by K-Means clustering to group semantically similar passages. Representative passages from each cluster are used to construct cluster-specific prompts for a large language model (LLM), generating intermediate answers that are filtered, reranked, and finally synthesized into a single, comprehensive response. This multi-stage pipeline enhances answer diversity, relevance, and faithfulness to retrieved evidence. Evaluated on the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the official leaderboard, demonstrating the effectiveness of clustering-based context filtering and prompt aggregation in large-scale RAG systems.

LGJun 15, 2025
Silhouette-Guided Instance-Weighted k-means

Aggelos Semoglou, Aristidis Likas, John Pavlopoulos

Clustering is a fundamental unsupervised learning task with numerous applications across diverse fields. Popular algorithms such as k-means often struggle with outliers or imbalances, leading to distorted centroids and suboptimal partitions. We introduce K-Sil, a silhouette-guided refinement of the k-means algorithm that weights points based on their silhouette scores, prioritizing well-clustered instances while suppressing borderline or noisy regions. The algorithm emphasizes user-specified silhouette aggregation metrics: macro-, micro-averaged or a combination, through self-tuning weighting schemes, supported by appropriate sampling strategies and scalable approximations. These components ensure computational efficiency and adaptability to diverse dataset geometries. Theoretical guarantees establish centroid convergence, and empirical validation on synthetic and real-world datasets demonstrates statistically significant improvements in silhouette scores over k-means and two other instance-weighted k-means variants. These results establish K-Sil as a principled alternative for applications demanding high-quality, well-separated clusters.

LGJan 17, 2025
Counterfactual Explanations for k-means and Gaussian Clustering

Georgios Vardakas, Antonia Karra, Evaggelia Pitoura et al.

Counterfactuals have been recognized as an effective approach to explain classifier decisions. Nevertheless, they have not yet been considered in the context of clustering. In this work, we propose the use of counterfactuals to explain clustering solutions. First, we present a general definition for counterfactuals for model-based clustering that includes plausibility and feasibility constraints. Then we consider the counterfactual generation problem for k-means and Gaussian clustering assuming Euclidean distance. Our approach takes as input the factual, the target cluster, a binary mask indicating actionable or immutable features and a plausibility factor specifying how far from the cluster boundary the counterfactual should be placed. In the k-means clustering case, analytical mathematical formulas are presented for computing the optimal solution, while in the Gaussian clustering case (assuming full, diagonal, or spherical covariances) our method requires the numerical solution of a nonlinear equation with a single parameter only. We demonstrate the advantages of our approach through illustrative examples and quantitative experimental comparisons.

LGDec 20, 2024
Statistical Modeling of Univariate Multimodal Data

Paraskevi Chasani, Aristidis Likas

Unimodality constitutes a key property indicating grouping behavior of the data around a single mode of its density. We propose a method that partitions univariate data into unimodal subsets through recursive splitting around valley points of the data density. For valley point detection, we introduce properties of critical points on the convex hull of the empirical cumulative density function (ecdf) plot that provide indications on the existence of density valleys. Next, we apply a unimodal data modeling approach that provides a statistical model for each obtained unimodal subset in the form of a Uniform Mixture Model (UMM). Consequently, a hierarchical statistical model of the initial dataset is obtained in the form of a mixture of UMMs, named as the Unimodal Mixture Model (UDMM). The proposed method is non-parametric, hyperparameter-free, automatically estimates the number of unimodal subsets and provides accurate statistical models as indicated by experimental results on clustering and density estimation tasks.

AIApr 30, 2024
Augmented neural forms with parametric boundary-matching operators for solving ordinary differential equations

Adam D. Kypriadis, Isaac E. Lagaris, Aristidis Likas et al.

Approximating solutions of ordinary and partial differential equations constitutes a significant challenge. Based on functional expressions that inherently depend on neural networks, neural forms are specifically designed to precisely satisfy the prescribed initial or boundary conditions of the problem, while providing the approximate solutions in closed form. Departing from the important class of ordinary differential equations, the present work aims to refine and validate the neural forms methodology, paving the ground for further developments in more challenging fields. The main contributions are as follows. First, it introduces a formalism for systematically crafting proper neural forms with adaptable boundary matches that are amenable to optimization. Second, it describes a novel technique for converting problems with Neumann or Robin conditions into equivalent problems with parametric Dirichlet conditions. Third, it outlines a method for determining an upper bound on the absolute deviation from the exact solution. The proposed augmented neural forms approach was tested on a set of diverse problems, encompassing first- and second-order ordinary differential equations, as well as first-order systems. Stiff differential equations have been considered as well. The resulting solutions were subjected to assessment against existing exact solutions, solutions derived through the common penalized neural method, and solutions obtained via contemporary numerical analysis methods. The reported results demonstrate that the augmented neural forms not only satisfy the boundary and initial conditions exactly, but also provide closed-form solutions that facilitate high-quality interpolation and controllable overall precision. These attributes are essential for expanding the application field of neural forms to more challenging problems that are described by partial differential equations.

LGAug 28, 2020
The UU-test for Statistical Modeling of Unimodal Data

Paraskevi Chasani, Aristidis Likas

Deciding on the unimodality of a dataset is an important problem in data analysis and statistical modeling. It allows to obtain knowledge about the structure of the dataset, ie. whether data points have been generated by a probability distribution with a single or more than one peaks. Such knowledge is very useful for several data analysis problems, such as for deciding on the number of clusters and determining unimodal projections. We propose a technique called UU-test (Unimodal Uniform test) to decide on the unimodality of a one-dimensional dataset. The method operates on the empirical cumulative density function (ecdf) of the dataset. It attempts to build a piecewise linear approximation of the ecdf that is unimodal and models the data sufficiently in the sense that the data corresponding to each linear segment follows the uniform distribution. A unique feature of this approach is that in the case of unimodality, it also provides a statistical model of the data in the form of a Uniform Mixture Model. We present experimental results in order to assess the ability of the method to decide on unimodality and perform comparisons with the well-known dip-test approach. In addition, in the case of unimodal datasets we evaluate the Uniform Mixture Models provided by the proposed method using the test set log-likelihood and the two-sample Kolmogorov-Smirnov (KS) test.

NIDec 23, 2019
A Replication Strategy for Mobile Opportunistic Networks based on Utility Clustering

Evangelos Papapetrou, Aristidis Likas

Dynamic replication is a wide-spread multi-copy routing approach for efficiently coping with the intermittent connectivity in mobile opportunistic networks. According to it, a node forwards a message replica to an encountered node based on a utility value that captures the latter's fitness for delivering the message to the destination. The popularity of the approach stems from its flexibility to effectively operate in networks with diverse characteristics without requiring special customization. Nonetheless, its drawback is the tendency to produce a high number of replicas that consume limited resources such as energy and storage. To tackle the problem we make the observation that network nodes can be grouped, based on their utility values, into clusters that portray different delivery capabilities. We exploit this finding to transform the basic forwarding strategy, which is to move a packet using nodes of increasing utility, and actually forward it through clusters of increasing delivery capability. The new strategy works in synergy with the basic dynamic replication algorithms and is fully configurable, in the sense that it can be used with virtually any utility function. We also extend our approach to work with two utility functions at the same time, a feature that is especially efficient in mobile networks that exhibit social characteristics. By conducting experiments in a wide set of real-life networks, we empirically show that our method is robust in reducing the overall number of replicas in networks with diverse connectivity characteristics without at the same time hindering delivery efficiency.