Paul D. McNicholas

ME
h-index45
29papers
589citations
Novelty39%
AI Score46

29 Papers

MLOct 8, 2023
Clustering Three-Way Data with Outliers

Katharine M. Clark, Paul D. McNicholas

Matrix-variate distributions are a recent addition to the model-based clustering field, thereby making it possible to analyze data in matrix form with complex structure such as images and time series. Due to its recent appearance, there is limited literature on matrix-variate data, with even less on dealing with outliers in these models. An approach for clustering matrix-variate normal data with outliers is discussed. The approach, which uses the distribution of subset log-likelihoods, extends the OCLUST algorithm to matrix-variate normal data and uses an iterative approach to detect and trim outliers.

MLMay 8
Classification Fields: Arbitrarily Fine Recursive Hierarchical Clustering From Few Examples

Yicen Li, Ruiyang Hong, Anastasis Kratsios et al.

Classical clustering methods usually return either a finite partition of the observed data or a finite dendrogram over it. This finite-sample view is inadequate when the hierarchy of interest is a recursive geometric object with fine-scale refinements that continue beyond the levels directly observed. We introduce classification fields: infinite-depth hierarchical cluster structures on $\mathbb{R}^d$ generated by a local parent-to-child refinement rule. A classification field generator maps each parent centre to an ordered, bounded, and separated tuple of child residuals. Together with a root and a scale factor, this rule recursively generates cluster centres, Voronoi cells, and a metric DAG encoding the hierarchy. Given only a finite prefix of such a hierarchy, we learn a classification field predictor that approximates the generator and can be rolled out to unseen depths. We prove exponential truncation convergence in the completed cell metric and ReLU realizability with width $O(\varepsilon^{-γ})$ and depth $\widetilde O(\varepsilon^{-3γ/2})$, where $γ=\log K/(-\log s)$, up to finite-window aspect-ratio factors. The approximation holds at the level of the induced compact metric structures, measured in the completed cell-metric Hausdorff distance. Experimental validation on matched CFG-generated hierarchies, IFS fractals, and image-induced recursive clustering hierarchies shows that learned predictors preserve ordered child slots, unordered geometry, and hierarchy-level path metrics under recursive rollout. These results support the claim that finite hierarchical observations can reveal local refinement rules capable of generating substantially deeper classification fields.

MLApr 25
Turtle shell clustering: A mixture approach to discriminative clustering with applications to flow cytometry and other data

Mackenzie R. Neal, Paul D. McNicholas, Arthur White

Generative approaches to clustering provide information on geometric properties of clusters, whereas discriminative approaches provide boundaries between clusters. Ideas from both approaches are incorporated to present a fully unsupervised, probabilistic, and discriminative clustering method via a regularized mutual information objective function, wherein a mixture of mixtures of Gaussian and uniform distributions is used for formulation of the conditional model. Automatic selection of the number of components is established with the introduction of the regularizing term and a merge step, similar to those applied in reversible jump Markov chain Monte Carlo methods used in Bayesian clustering. Consequently, the turtle shell method -- a fully unsupervised clustering method capable of estimating non-linear boundary lines, automatically selecting the number of components, and capturing intuitive clusters in the presence of data abnormalities such as noise and/or irregular cluster shapes -- is introduced. We test this method on various simulated and real datasets commonly explored in clustering research, and extend the analysis to datasets arising from flow cytometry experiments.

MEMay 14, 2025
Depth-Based Local Center Clustering: A Framework for Handling Different Clustering Scenarios

Siyi Wang, Alexandre Leblanc, Paul D. McNicholas

Cluster analysis, or clustering, plays a crucial role across numerous scientific and engineering domains. Despite the wealth of clustering methods proposed over the past decades, each method is typically designed for specific scenarios and presents certain limitations in practical applications. In this paper, we propose depth-based local center clustering (DLCC). This novel method makes use of data depth, which is known to produce a center-outward ordering of sample points in a multivariate space. However, data depth typically fails to capture the multimodal characteristics of {data}, something of the utmost importance in the context of clustering. To overcome this, DLCC makes use of a local version of data depth that is based on subsets of {data}. From this, local centers can be identified as well as clusters of varying shapes. Furthermore, we propose a new internal metric based on density-based clustering to evaluate clustering performance on {non-convex clusters}. Overall, DLCC is a flexible clustering approach that seems to overcome some limitations of traditional clustering methods, thereby enhancing data analysis capabilities across a wide range of application scenarios.

MLJul 31, 2025
funOCLUST: Clustering Functional Data with Outliers

Katharine M. Clark, Paul D. McNicholas

Functional data present unique challenges for clustering due to their infinite-dimensional nature and potential sensitivity to outliers. An extension of the OCLUST algorithm to the functional setting is proposed to address these issues. The approach leverages the OCLUST framework, creating a robust method to cluster curves and trim outliers. The methodology is evaluated on both simulated and real-world functional datasets, demonstrating strong performance in clustering and outlier identification.

CVFeb 6, 2025
Keep It Light! Simplifying Image Clustering Via Text-Free Adapters

Yicen Li, Haitz Sáez de Ocáriz Borde, Anastasis Kratsios et al. · eth-zurich

Many competitive clustering pipelines have a multi-modal design, leveraging large language models (LLMs) or other text encoders, and text-image pairs, which are often unavailable in real-world downstream applications. Additionally, such frameworks are generally complicated to train and require substantial computational resources, making widespread adoption challenging. In this work, we show that in deep clustering, competitive performance with more complex state-of-the-art methods can be achieved using a text-free and highly simplified training pipeline. In particular, our approach, Simple Clustering via Pre-trained models (SCP), trains only a small cluster head while leveraging pre-trained vision model feature representations and positive data pairs. Experiments on benchmark datasets including CIFAR-10, CIFAR-20, CIFAR-100, STL-10, ImageNet-10, and ImageNet-Dogs, demonstrate that SCP achieves highly competitive performance. Furthermore, we provide a theoretical result explaining why, at least under ideal conditions, additional text-based embeddings may not be necessary to achieve strong clustering performance in vision.

MEJul 19, 2019
Clustering Higher Order Data: An Application to Pediatric Multi-variable Longitudinal Data

Peter A. Tait, Paul D. McNicholas, Joyce Obeid

Physical activity levels are an important predictor of cardiovascular health and increasingly being measured by sensors, like accelerometers. Accelerometers produce rich multivariate data that can inform important clinical decisions related to individual patients and public health. The CHAMPION study, a study of youth with chronic inflammatory conditions, aims to determine the links between heart health, inflammation, physical activity, and fitness. The accelerometer data from CHAMPION is represented as 4-dimensional arrays, and a finite mixture of multidimensional arrays model is developed for clustering. The use of model-based clustering for multidimensional arrays has thus far been limited to two-dimensional arrays, i.e., matrices or order-two tensors, and the work in this paper can also be seen as an approach for clustering D-dimensional arrays for D > 2 or, in other words, for clustering order-D tensors.

MEJul 2, 2019
Finding Outliers in Gaussian Model-Based Clustering

Katharine M. Clark, Paul D. McNicholas

Clustering, or unsupervised classification, is a task often plagued by outliers. Yet there is a paucity of work on handling outliers in clustering. Outlier identification algorithms tend to fall into three broad categories: outlier inclusion, outlier trimming, and post hoc outlier identification methods, with the former two often requiring pre-specification of the number of outliers. The fact that sample squared Mahalanobis distance is beta-distributed is used to derive an approximate distribution for the log-likelihoods of subset finite Gaussian mixture models. An algorithm is then proposed that removes the least plausible points according to the subset log-likelihoods, which are deemed outliers, until the subset log-likelihoods adhere to the reference distribution. This results in a trimming method, called OCLUST, that inherently estimates the number of outliers.

MEMar 12, 2019
Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions

Alexa A. Sochaniwsky, Michael P. B. Gallaugher, Yang Tang et al.

Robust clustering of high-dimensional data is an important topic because clusters in real datasets are often heavy-tailed and/or asymmetric. Traditional approaches to model-based clustering often fail for high dimensional data, e.g., due to the number of free covariance parameters. A parametrization of the component scale matrices for the mixture of generalized hyperbolic distributions is proposed. This parameterization includes a penalty term in the likelihood. An analytically feasible expectation-maximization algorithm is developed by placing a gamma-lasso penalty constraining the concentration matrix. The proposed methodology is investigated through simulation studies and illustrated using two real datasets.

MEJan 26, 2019
Clustering Discrete-Valued Time Series

Tyler Roick, Dimitris Karlis, Paul D. McNicholas

There is a need for the development of models that are able to account for discreteness in data, along with its time series properties and correlation. Our focus falls on INteger-valued AutoRegressive (INAR) type models. The INAR type models can be used in conjunction with existing model-based clustering techniques to cluster discrete-valued time series data. With the use of a finite mixture model, several existing techniques such as the selection of the number of clusters, estimation using expectation-maximization and model selection are applicable. The proposed model is then demonstrated on real data to illustrate its clustering applications.

APDec 23, 2018
Detecting British Columbia Coastal Rainfall Patterns by Clustering Gaussian Processes

Forrest Paton, Paul D. McNicholas

Functional data analysis is a statistical framework where data are assumed to follow some functional form. This method of analysis is commonly applied to time series data, where time, measured continuously or in discrete intervals, serves as the location for a function's value. Gaussian processes are a generalization of the multivariate normal distribution to function space and, in this paper, they are used to shed light on coastal rainfall patterns in British Columbia (BC). Specifically, this work addressed the question over how one should carry out an exploratory cluster analysis for the BC, or any similar, coastal rainfall data. An approach is developed for clustering multiple processes observed on a comparable interval, based on how similar their underlying covariance kernel is. This approach provides interesting insights into the BC data, and these insights can be framed in terms of El Niño and La Niña; however, the result is not simply one cluster representing El Niño years and another for La Niña years. From one perspective, the results show that clustering annual rainfall can potentially be used to identify extreme weather patterns.

COOct 31, 2018
An Evolutionary Algorithm with Crossover and Mutation for Model-Based Clustering

Sharon M. McNicholas, Paul D. McNicholas, Daniel A. Ashlock

An evolutionary algorithm (EA) is developed as an alternative to the EM algorithm for parameter estimation in model-based clustering. This EA facilitates a different search of the fitness landscape, i.e., the likelihood surface, utilizing both crossover and mutation. Furthermore, this EA represents an efficient approach to "hard" model-based clustering and so it can be viewed as a sort of generalization of the k-means algorithm, which is itself equivalent to a restricted Gaussian mixture model. The EA is illustrated on several datasets, and its performance is compared to other hard clustering approaches and model-based clustering via the EM algorithm.

MESep 7, 2018
Mixtures of Skewed Matrix Variate Bilinear Factor Analyzers

Michael P. B. Gallaugher, Paul D. McNicholas

In recent years, data have become increasingly higher dimensional and, therefore, an increased need has arisen for dimension reduction techniques for clustering. Although such techniques are firmly established in the literature for multivariate data, there is a relative paucity in the area of matrix variate, or three-way, data. Furthermore, the few methods that are available all assume matrix variate normality, which is not always sensible if cluster skewness or excess kurtosis is present. Mixtures of bilinear factor analyzers using skewed matrix variate distributions are proposed. In all, four such mixture models are presented, based on matrix variate skew-t, generalized hyperbolic, variance-gamma, and normal inverse Gaussian distributions, respectively.

MEApr 13, 2018
A Latent Gaussian Mixture Model for Clustering Longitudinal Data

Vanessa S. E. Bierling, Paul D. McNicholas

Finite mixture models have become a popular tool for clustering. Amongst other uses, they have been applied for clustering longitudinal data and clustering high-dimensional data. In the latter case, a latent Gaussian mixture model is sometimes used. Although there has been much work on clustering using latent variables and on clustering longitudinal data, respectively, there has been a paucity of work that combines these features. An approach is developed for clustering longitudinal data with many time points based on an extension of the mixture of common factor analyzers model. A variation of the expectation-maximization algorithm is used for parameter estimation and the Bayesian information criterion is used for model selection. The approach is illustrated using real and simulated data.

MEFeb 13, 2018
Clustering and Semi-Supervised Classification for Clickstream Data via Mixture Models

Michael P. B. Gallaugher, Paul D. McNicholas

Finite mixture models have been used for unsupervised learning for some time, and their use within the semi-supervised paradigm is becoming more commonplace. Clickstream data is one of the various emerging data types that demands particular attention because there is a notable paucity of statistical learning approaches currently available. A mixture of first-order continuous time Markov models is introduced for unsupervised and semi-supervised learning of clickstream data. This approach assumes continuous time, which distinguishes it from existing mixture model-based approaches; practically, this allows account to be taken of the amount of time each user spends on each webpage. The approach is evaluated, and compared to the discrete time approach, using simulated and real data.

MEDec 22, 2017
A Mixture of Matrix Variate Bilinear Factor Analyzers

Michael P. B. Gallaugher, Paul D. McNicholas

Over the years data has become increasingly higher dimensional, which has prompted an increased need for dimension reduction techniques. This is perhaps especially true for clustering (unsupervised classification) as well as semi-supervised and supervised classification. Although dimension reduction in the area of clustering for multivariate data has been quite thoroughly discussed within the literature, there is relatively little work in the area of three-way, or matrix variate, data. Herein, we develop a mixture of matrix variate bilinear factor analyzers (MMVBFA) model for use in clustering high-dimensional matrix variate data. This work can be considered both the first matrix variate bilinear factor analysis model as well as the first MMVBFA model. Parameter estimation is discussed, and the MMVBFA model is illustrated using simulated and real data.

CONov 3, 2014
Multivariate response and parsimony for Gaussian cluster-weighted models

Utkarsh J. Dang, Antonio Punzo, Paul D. McNicholas et al.

A family of parsimonious Gaussian cluster-weighted models is presented. This family concerns a multivariate extension to cluster-weighted modelling that can account for correlations between multivariate responses. Parsimony is attained by constraining parts of an eigen-decomposition imposed on the component covariance matrices. A sufficient condition for identifiability is provided and an expectation-maximization algorithm is presented for parameter estimation. Model performance is investigated on both synthetic and classical real data sets and compared with some popular approaches. Finally, accounting for linear dependencies in the presence of a linear regression structure is shown to offer better performance, vis-à-vis clustering, over existing methodologies.

MEApr 11, 2014
Model Based Clustering of High-Dimensional Binary Data

Yang Tang, Ryan P. Browne, Paul D. McNicholas

We propose a mixture of latent trait models with common slope parameters (MCLT) for model-based clustering of high-dimensional binary data, a data type for which few established methods exist. Recent work on clustering of binary data, based on a $d$-dimensional Gaussian latent variable, is extended by incorporating common factor analyzers. Accordingly, our approach facilitates a low-dimensional visual representation of the clusters. We extend the model further by the incorporation of random block effects. The dependencies in each block are taken into account through block-specific parameters that are considered to be random variables. A variational approximation to the likelihood is exploited to derive a fast algorithm for determining the model parameters. Our approach is demonstrated on real and simulated data.

MEFeb 26, 2014
Asymmetric Clusters and Outliers: Mixtures of Multivariate Contaminated Shifted Asymmetric Laplace Distributions

Katherine Morris, Antonio Punzo, Paul D. McNicholas et al.

Mixtures of multivariate contaminated shifted asymmetric Laplace distributions are developed for handling asymmetric clusters in the presence of outliers (also referred to as bad points herein). In addition to the parameters of the related non-contaminated mixture, for each (asymmetric) cluster, our model has one parameter controlling the proportion of outliers and one specifying the degree of contamination. Crucially, these parameters do not have to be specified a priori, adding a flexibility to our approach that is absent from other approaches such as trimming. Moreover, each observation is given a posterior probability of belonging to a particular cluster, and of being an outlier or not; advantageously, this allows for the automatic detection of outliers. An expectation-conditional maximization algorithm is outlined for parameter estimation and various implementation issues are discussed. The behaviour of the proposed model is investigated, and compared with well-established finite mixtures, on artificial and real data.

MEDec 2, 2013
Families of Parsimonious Finite Mixtures of Regression Models

Utkarsh J. Dang, Paul D. McNicholas

Finite mixtures of regression models offer a flexible framework for investigating heterogeneity in data with functional dependencies. These models can be conveniently used for unsupervised learning on data with clear regression relationships. We extend such models by imposing an eigen-decomposition on the multivariate error covariance matrix. By constraining parts of this decomposition, we obtain families of parsimonious mixtures of regressions and mixtures of regressions with concomitant variables. These families of models account for correlations between multiple responses. An expectation-maximization algorithm is presented for parameter estimation and performance is illustrated on simulated and real data.

MENov 26, 2013
A Mixture of Generalized Hyperbolic Factor Analyzers

Cristina Tortora, Paul D. McNicholas, Ryan P. Browne

Model-based clustering imposes a finite mixture modelling structure on data for clustering. Finite mixture models assume that the population is a convex combination of a finite number of densities, the distribution within each population is a basic assumption of each particular model. Among all distributions that have been tried, the generalized hyperbolic distribution has the advantage that is a generalization of several other methods, such as the Gaussian distribution, the skew t-distribution, etc. With specific parameters, it can represent either a symmetric or a skewed distribution. While its inherent flexibility is an advantage in many ways, it means the estimation of more parameters than its special and limiting cases. The aim of this work is to propose a mixture of generalized hyperbolic factor analyzers to introduce parsimony and extend the method to high dimensional data. This work can be seen as an extension of the mixture of factor analyzers model to generalized hyperbolic mixtures. The performance of our generalized hyperbolic factor analyzers is illustrated on real data, where it performs favourably compared to its Gaussian analogue.

MENov 1, 2013
Parsimonious Shifted Asymmetric Laplace Mixtures

Brian C. Franczak, Paul D. McNicholas, Ryan P. Browne et al.

A family of parsimonious shifted asymmetric Laplace mixture models is introduced. We extend the mixture of factor analyzers model to the shifted asymmetric Laplace distribution. Imposing constraints on the constitute parts of the resulting decomposed component scale matrices leads to a family of parsimonious models. An explicit two-stage parameter estimation procedure is described, and the Bayesian information criterion and the integrated completed likelihood are compared for model selection. This novel family of models is applied to real data, where it is compared to its Gaussian analogue within clustering and classification paradigms.

MESep 7, 2013
Variational Bayes Approximations for Clustering via Mixtures of Normal Inverse Gaussian Distributions

Sanjeena Subedi, Paul D. McNicholas

Parameter estimation for model-based clustering using a finite mixture of normal inverse Gaussian (NIG) distributions is achieved through variational Bayes approximations. Univariate NIG mixtures and multivariate NIG mixtures are considered. The use of variational Bayes approximations here is a substantial departure from the traditional EM approach and alleviates some of the associated computational complexities and uncertainties. Our variational algorithm is applied to simulated and real data. The paper concludes with discussion and suggestions for future work.

MEAug 28, 2013
Clustering, Classification, Discriminant Analysis, and Dimension Reduction via Generalized Hyperbolic Mixtures

Katherine Morris, Paul D. McNicholas

A method for dimension reduction with clustering, classification, or discriminant analysis is introduced. This mixture model-based approach is based on fitting generalized hyperbolic mixtures on a reduced subspace within the paradigm of model-based clustering, classification, or discriminant analysis. A reduced subspace of the data is derived by considering the extent to which group means and group covariances vary. The members of the subspace arise through linear combinations of the original data, and are ordered by importance via the associated eigenvalues. The observations can be projected onto the subspace, resulting in a set of variables that captures most of the clustering information available. The use of generalized hyperbolic mixtures gives a robust framework capable of dealing with skewed clusters. Although dimension reduction is increasingly in demand across many application areas, the authors are most familiar with biological applications and so two of the five real data examples are within that sphere. Simulated data are also used for illustration. The approach introduced herein can be considered the most general such approach available, and so we compare results to three special and limiting cases. Comparisons with several well established techniques illustrate its promising performance.

APAug 16, 2013
Standardizing Interestingness Measures for Association Rules

Mateen Shaikh, Paul D. McNicholas, M. Luiza Antonie et al.

Interestingness measures provide information that can be used to prune or select association rules. A given value of an interestingness measure is often interpreted relative to the overall range of the values that the interestingness measure can take. However, properties of individual association rules restrict the values an interestingness measure can achieve. An interesting measure can be standardized to take this into account, but this has only been done for one interestingness measure to date, i.e., the lift. Standardization provides greater insight than the raw value and may even alter researchers' perception of the data. We derive standardized analogues of three interestingness measures and use real and simulated data to compare them to their raw versions, each other, and the standardized lift.

MEJul 21, 2013
Mixtures of Common Skew-t Factor Analyzers

Paula M. Murray, Paul D. McNicholas, Ryan P. Browne

A mixture of common skew-t factor analyzers model is introduced for model-based clustering of high-dimensional data. By assuming common component factor loadings, this model allows clustering to be performed in the presence of a large number of mixture components or when the number of dimensions is too large to be well-modelled by the mixtures of factor analyzers model or a variant thereof. Furthermore, assuming that the component densities follow a skew-t distribution allows robust clustering of skewed data. The alternating expectation-conditional maximization algorithm is employed for parameter estimation. We demonstrate excellent clustering performance when our model is applied to real and simulated data.This paper marks the first time that skewed common factors have been used.

MEJul 13, 2013
Fractionally-Supervised Classification

Irene Vrbik, Paul D. McNicholas

Traditionally, there are three species of classification: unsupervised, supervised, and semi-supervised. Supervised and semi-supervised classification differ by whether or not weight is given to unlabelled observations in the classification procedure. In unsupervised classification, or clustering, all observations are unlabeled and hence full weight is given to unlabelled observations. When some observations are unlabelled, it can be very difficult to \textit{a~priori} choose the optimal level of supervision, and the consequences of a sub-optimal choice can be non-trivial. A flexible fractionally-supervised approach to classification is introduced, where any level of supervision --- ranging from unsupervised to supervised --- can be attained. Our approach uses a weighted likelihood, wherein weights control the relative role that labelled and unlabelled data have in building a classifier. A comparison between our approach and the traditional species is presented using simulated and real data. Gaussian mixture models are used as a vehicle to illustrate our fractionally-supervised classification approach; however, it is broadly applicable and variations on the postulated model can be easily made.

MEJun 23, 2013
A Variational Approximations-DIC Rubric for Parameter Estimation and Mixture Model Selection Within a Family Setting

Sanjeena Subedi, Paul D. McNicholas

Mixture model-based clustering has become an increasingly popular data analysis technique since its introduction over fifty years ago, and is now commonly utilized within a family setting. Families of mixture models arise when the component parameters, usually the component covariance (or scale) matrices, are decomposed and a number of constraints are imposed. Within the family setting, model selection involves choosing the member of the family, i.e., the appropriate covariance structure, in addition to the number of mixture components. To date, the Bayesian information criterion (BIC) has proved most effective for model selection, and the expectation-maximization (EM) algorithm is usually used for parameter estimation. In fact, this EM-BIC rubric has virtually monopolized the literature on families of mixture models. Deviating from this rubric, variational Bayes approximations are developed for parameter estimation and the deviance information criterion for model selection. The variational Bayes approach provides an alternate framework for parameter estimation by constructing a tight lower bound on the complex marginal likelihood and maximizing this lower bound by minimizing the associated Kullback-Leibler divergence. This approach is taken on the most commonly used family of Gaussian mixture models, and real and simulated data are used to compare the new approach to the EM-BIC rubric.

MEJul 6, 2012
Mixtures of Shifted Asymmetric Laplace Distributions

Brian C. Franczak, Ryan P. Browne, Paul D. McNicholas

A mixture of shifted asymmetric Laplace distributions is introduced and used for clustering and classification. A variant of the EM algorithm is developed for parameter estimation by exploiting the relationship with the general inverse Gaussian distribution. This approach is mathematically elegant and relatively computationally straightforward. Our novel mixture modelling approach is demonstrated on both simulated and real data to illustrate clustering and classification applications. In these analyses, our mixture of shifted asymmetric Laplace distributions performs favourably when compared to the popular Gaussian approach. This work, which marks an important step in the non-Gaussian model-based clustering and classification direction, concludes with discussion as well as suggestions for future work.