LGOct 13, 2020
Mixed data Deep Gaussian Mixture Model: A clustering model for mixed datasetsRobin Fuchs, Denys Pommeret, Cinzia Viroli
Clustering mixed data presents numerous challenges inherent to the very heterogeneous nature of the variables. A clustering algorithm should be able, despite of this heterogeneity, to extract discriminant pieces of information from the variables in order to design groups. In this work we introduce a multilayer architecture model-based clustering method called Mixed Deep Gaussian Mixture Model (MDGMM) that can be viewed as an automatic way to merge the clustering performed separately on continuous and non-continuous data. This architecture is flexible and can be adapted to mixed as well as to continuous or non-continuous data. In this sense we generalize Generalized Linear Latent Variable Models and Deep Gaussian Mixture Models. We also design a new initialisation strategy and a data driven method that selects the best specification of the model and the optimal number of clusters for a given dataset "on the fly". Besides, our model provides continuous low-dimensional representations of the data which can be a useful tool to visualize mixed datasets. Finally, we validate the performance of our approach comparing its results with state-of-the-art mixed data clustering models over several commonly used datasets.
CLFeb 18, 2019
Classifying textual data: shallow, deep and ensemble methodsLaura Anderlucci, Lucia Guastadisegni, Cinzia Viroli
This paper focuses on a comparative evaluation of the most common and modern methods for text classification, including the recent deep learning strategies and ensemble methods. The study is motivated by a challenging real data problem, characterized by high-dimensional and extremely sparse data, deriving from incoming calls to the customer care of an Italian phone company. We will show that deep learning outperforms many classical (shallow) strategies but the combination of shallow and deep learning methods in a unique ensemble classifier may improve the robustness and the accuracy of "single" classification methods.
MLFeb 18, 2019
Deep Mixtures of Unigrams for uncovering Topics in Textual DataCinzia Viroli, Laura Anderlucci
Mixtures of Unigrams are one of the simplest and most efficient tools for clustering textual data, as they assume that documents related to the same topic have similar distributions of terms, naturally described by Multinomials. When the classification task is particularly challenging, such as when the document-term matrix is high-dimensional and extremely sparse, a more composite representation can provide better insight on the grouping structure. In this work, we developed a deep version of mixtures of Unigrams for the unsupervised classification of very short documents with a large number of terms, by allowing for models with further deeper latent layers; the proposal is derived in a Bayesian framework. The behaviour of the Deep Mixtures of Unigrams is empirically compared with that of other traditional and state-of-the-art methods, namely $k$-means with cosine distance, $k$-means with Euclidean distance on data transformed according to Semantic Analysis, Partition Around Medoids, Mixture of Gaussians on semantic-based transformed data, hierarchical clustering according to Ward's method with cosine dissimilarity, Latent Dirichlet Allocation, Mixtures of Unigrams estimated via the EM algorithm, Spectral Clustering and Affinity Propagation clustering. The performance is evaluated in terms of both correct classification rate and Adjusted Rand Index. Simulation studies and real data analysis prove that going deep in clustering such data highly improves the classification accuracy.
MLNov 18, 2017
Deep Gaussian Mixture ModelsCinzia Viroli, Geoffrey J. McLachlan
Deep learning is a hierarchical inference method formed by subsequent multiple layers of learning able to more efficiently describe complex relationships. In this work, Deep Gaussian Mixture Models are introduced and discussed. A Deep Gaussian Mixture model (DGMM) is a network of multiple layers of latent variables, where, at each layer, the variables follow a mixture of Gaussian distributions. Thus, the deep mixture model consists of a set of nested mixtures of linear models, which globally provide a nonlinear model able to describe the data in a very flexible way. In order to avoid overparameterized solutions, dimension reduction by factor models can be applied at each layer of the architecture thus resulting in deep mixtures of factor analysers.