Pietro Michiardi

LG
h-index26
35papers
676citations
Novelty53%
AI Score50

35 Papers

LGApr 4, 2022
Do Deep Neural Networks Contribute to Multivariate Time Series Anomaly Detection?

Julien Audibert, Pietro Michiardi, Frédéric Guyard et al.

Anomaly detection in time series is a complex task that has been widely studied. In recent years, the ability of unsupervised anomaly detection algorithms has received much attention. This trend has led researchers to compare only learning-based methods in their articles, abandoning some more conventional approaches. As a result, the community in this field has been encouraged to propose increasingly complex learning-based models mainly based on deep neural networks. To our knowledge, there are no comparative studies between conventional, machine learning-based and, deep neural network methods for the detection of anomalies in multivariate time series. In this work, we study the anomaly detection performance of sixteen conventional, machine learning-based and, deep neural network approaches on five real-world open datasets. By analyzing and comparing the performance of each of the sixteen methods, we show that no family of methods outperforms the others. Therefore, we encourage the community to reincorporate the three categories of methods in the anomaly detection in multivariate time series benchmarks.

MLJun 10, 2022
How Much is Enough? A Study on Diffusion Times in Score-based Generative Models

Giulio Franzese, Simone Rossi, Lixuan Yang et al.

Score-based diffusion models are a class of generative models whose dynamics is described by stochastic differential equations that map noise into data. While recent works have started to lay down a theoretical foundation for these models, an analytical understanding of the role of the diffusion time T is still lacking. Current best practice advocates for a large T to ensure that the forward dynamics brings the diffusion sufficiently close to a known and simple noise distribution; however, a smaller value of T should be preferred for a better approximation of the score-matching objective and higher computational efficiency. Starting from a variational interpretation of diffusion models, in this work we quantify this trade-off, and suggest a new method to improve quality and efficiency of both training and sampling, by adopting smaller diffusion times. Indeed, we show how an auxiliary model can be used to bridge the gap between the ideal and the simulated forward dynamics, followed by a standard reverse diffusion process. Empirical results support our analysis; for image data, our method is competitive w.r.t. the state-of-the-art, according to standard sample quality metrics and log-likelihood.

LGMar 1, 2023
Continuous-Time Functional Diffusion Processes

Giulio Franzese, Giulio Corallo, Simone Rossi et al.

We introduce Functional Diffusion Processes (FDPs), which generalize score-based diffusion models to infinite-dimensional function spaces. FDPs require a new mathematical framework to describe the forward and backward dynamics, and several extensions to derive practical training objectives. These include infinite-dimensional versions of Girsanov theorem, in order to be able to compute an ELBO, and of the sampling theorem, in order to guarantee that functional evaluations in a countable set of points are equivalent to infinite-dimensional functions. We use FDPs to build a new breed of generative models in function spaces, which do not require specialized network architectures, and that can work with any kind of continuous data. Our results on real data show that FDPs achieve high-quality image generation, using a simple MLP architecture with orders of magnitude fewer parameters than existing diffusion models.

LGJan 7, 2023
"It's a Match!" -- A Benchmark of Task Affinity Scores for Joint Learning

Raphael Azorin, Massimo Gallo, Alessandro Finamore et al.

While the promises of Multi-Task Learning (MTL) are attractive, characterizing the conditions of its success is still an open problem in Deep Learning. Some tasks may benefit from being learned together while others may be detrimental to one another. From a task perspective, grouping cooperative tasks while separating competing tasks is paramount to reap the benefits of MTL, i.e., reducing training and inference costs. Therefore, estimating task affinity for joint learning is a key endeavor. Recent work suggests that the training conditions themselves have a significant impact on the outcomes of MTL. Yet, the literature is lacking of a benchmark to assess the effectiveness of tasks affinity estimation techniques and their relation with actual MTL performance. In this paper, we take a first step in recovering this gap by (i) defining a set of affinity scores by both revisiting contributions from previous literature as well presenting new ones and (ii) benchmarking them on the Taskonomy dataset. Our empirical campaign reveals how, even in a small-scale scenario, task affinity scoring does not correlate well with actual MTL performance. Yet, some metrics can be more indicative than others.

LGApr 13, 2022
Safer Autonomous Driving in a Stochastic, Partially-Observable Environment by Hierarchical Contingency Planning

Ugo Lecerf, Christelle Yemdji-Tchassi, Pietro Michiardi

When learning to act in a stochastic, partially observable environment, an intelligent agent should be prepared to anticipate a change in its belief of the environment state, and be capable of adapting its actions on-the-fly to changing conditions. As humans, we are able to form contingency plans when learning a task with the explicit aim of being able to correct errors in the initial control, and hence prove useful if ever there is a sudden change in our perception of the environment which requires immediate corrective action. This is especially the case for autonomous vehicles (AVs) navigating real-world situations where safety is paramount, and a strong ability to react to a changing belief about the environment is truly needed. In this paper we explore an end-to-end approach, from training to execution, for learning robust contingency plans and combining them with a hierarchical planner to obtain a robust agent policy in an autonomous navigation task where other vehicles' behaviours are unknown, and the agent's belief about these behaviours is subject to sudden, last-second change. We show that our approach results in robust, safe behaviour in a partially observable, stochastic environment, generalizing well over environment dynamics not seen during training.

LGApr 11, 2022
Automatically Learning Fallback Strategies with Model-Free Reinforcement Learning in Safety-Critical Driving Scenarios

Ugo Lecerf, Christelle Yemdji-Tchassi, Sébastien Aubert et al.

When learning to behave in a stochastic environment where safety is critical, such as driving a vehicle in traffic, it is natural for human drivers to plan fallback strategies as a backup to use if ever there is an unexpected change in the environment. Knowing to expect the unexpected, and planning for such outcomes, increases our capability for being robust to unseen scenarios and may help prevent catastrophic failures. Control of Autonomous Vehicles (AVs) has a particular interest in knowing when and how to use fallback strategies in the interest of safety. Due to imperfect information available to an AV about its environment, it is important to have alternate strategies at the ready which might not have been deduced from the original training data distribution. In this paper we present a principled approach for a model-free Reinforcement Learning (RL) agent to capture multiple modes of behaviour in an environment. We introduce an extra pseudo-reward term to the reward model, to encourage exploration to areas of state-space different from areas privileged by the optimal policy. We base this reward term on a distance metric between the trajectories of agents, in order to force policies to focus on different areas of state-space than the initial exploring agent. Throughout the paper, we refer to this particular training paradigm as learning fallback strategies. We apply this method to an autonomous driving scenario, and show that we are able to learn useful policies that would have otherwise been missed out on during training, and unavailable to use when executing the control algorithm.

LGOct 21, 2023
Toward Generative Data Augmentation for Traffic Classification

Chao Wang, Alessandro Finamore, Pietro Michiardi et al.

Data Augmentation (DA)-augmenting training data with synthetic samples-is wildly adopted in Computer Vision (CV) to improve models performance. Conversely, DA has not been yet popularized in networking use cases, including Traffic Classification (TC). In this work, we present a preliminary study of 14 hand-crafted DAs applied on the MIRAGE19 dataset. Our results (i) show that DA can reap benefits previously unexplored in TC and (ii) foster a research agenda on the use of generative models to automate DA design.

LGOct 13, 2023
MINDE: Mutual Information Neural Diffusion Estimation

Giulio Franzese, Mustapha Bounoua, Pietro Michiardi

In this work we present a new method for the estimation of Mutual Information (MI) between random variables. Our approach is based on an original interpretation of the Girsanov theorem, which allows us to use score-based diffusion models to estimate the Kullback Leibler divergence between two densities as a difference between their score functions. As a by-product, our method also enables the estimation of the entropy of random variables. Armed with such building blocks, we present a general recipe to measure MI, which unfolds in two directions: one uses conditional diffusion process, whereas the other uses joint diffusion processes that allow simultaneous modelling of two random variables. Our results, which derive from a thorough experimental protocol over all the variants of our approach, indicate that our method is more accurate than the main alternatives from the literature, especially for challenging distributions. Furthermore, our methods pass MI self-consistency tests, including data processing and additivity under independence, which instead are a pain-point of existing methods.

LGJun 7, 2023
Multi-modal Latent Diffusion

Mustapha Bounoua, Giulio Franzese, Pietro Michiardi

Multi-modal data-sets are ubiquitous in modern applications, and multi-modal Variational Autoencoders are a popular family of models that aim to learn a joint representation of the different modalities. However, existing approaches suffer from a coherence-quality tradeoff, where models with good generation quality lack generative coherence across modalities, and vice versa. We discuss the limitations underlying the unsatisfactory performance of existing methods, to motivate the need for a different approach. We propose a novel method that uses a set of independently trained, uni-modal, deterministic autoencoders. Individual latent variables are concatenated into a common latent space, which is fed to a masked diffusion model to enable generative modeling. We also introduce a new multi-time training method to learn the conditional score network for multi-modal diffusion. Our methodology substantially outperforms competitors in both generation quality and coherence, as shown through an extensive experimental campaign.

LGFeb 6
Improved Sampling Schedules for Discrete Diffusion Models

Alberto Foresti, Mustapha Bounoua, Giulio Franzese et al.

Discrete diffusion models have emerged as a powerful paradigm for generative modeling on sequence data; however, the information-theoretic principles governing their reverse processes remain significantly less understood than those of their continuous counterparts. In this work, we bridge this gap by analyzing the reverse process dynamics through the lens of thermodynamic entropy production. We propose the entropy production rate as a rigorous proxy for quantifying information generation, deriving as a byproduct a bound on the Wasserstein distance between intermediate states and the data distribution. Leveraging these insights, we introduce two novel sampling schedules that are uniformly spaced with respect to their corresponding physics-inspired metrics: the Entropic Discrete Schedule (EDS), which is defined by maintaining a constant rate of information gain, and the Wasserstein Discrete Schedule (WDS), which is defined by taking equal steps in terms of the Wasserstein distance. We empirically demonstrate that our proposed schedules significantly outperform state-of-the-art strategies across diverse application domains, including synthetic data, music notation, vision and language modeling, consistently achieving superior performance at a lower computational budget.

LGJun 21, 2024Code
Fine-grained Attention in Hierarchical Transformers for Tabular Time-series

Raphael Azorin, Zied Ben Houidi, Massimo Gallo et al.

Tabular data is ubiquitous in many real-life systems. In particular, time-dependent tabular data, where rows are chronologically related, is typically used for recording historical events, e.g., financial transactions, healthcare records, or stock history. Recently, hierarchical variants of the attention mechanism of transformer architectures have been used to model tabular time-series data. At first, rows (or columns) are encoded separately by computing attention between their fields. Subsequently, encoded rows (or columns) are attended to one another to model the entire tabular time-series. While efficient, this approach constrains the attention granularity and limits its ability to learn patterns at the field-level across separate rows, or columns. We take a first step to address this gap by proposing Fieldy, a fine-grained hierarchical model that contextualizes fields at both the row and column levels. We compare our proposal against state of the art models on regression and classification tasks using public tabular time-series datasets. Our results show that combining row-wise and column-wise attention improves performance without increasing model size. Code and data are available at https://github.com/raphaaal/fieldy.

67.7LGApr 22
Relative Entropy Estimation in Function Space: Theory and Applications to Trajectory Inference

Chao Wang, Luca Nepote, Giulio Franzese et al.

Trajectory Inference (TI) seeks to recover latent dynamical processes from snapshot data, where only independent samples from time-indexed marginals are observed. In applications such as single-cell genomics, destructive measurements make path-space laws non-identifiable from finitely many marginals, leaving held-out marginal prediction as the dominant but limited evaluation protocol. We introduce a general framework for estimating the Kullback-Leibler divergence (KL) divergence between probability measures on function space, yielding a tractable, data-driven estimator that is scalable to realistic snapshot datasets. We validate the accuracy of our estimator on a benchmark suite, where the estimated functional KL closely matches the analytic KL. Applying this framework to synthetic and real scRNA-seq datasets, we show that current evaluation metrics often give inconsistent assessments, whereas path-space KL enables a coherent comparison of trajectory inference methods and exposes discrepancies in inferred dynamics, especially in regions with sparse or missing data. These results support functional KL as a principled criterion for evaluating trajectory inference under partial observability.

LGMar 11, 2025
Learning to Match Unpaired Data with Minimum Entropy Coupling

Mustapha Bounoua, Giulio Franzese, Pietro Michiardi

Multimodal data is a precious asset enabling a variety of downstream tasks in machine learning. However, real-world data collected across different modalities is often not paired, which is a significant challenge to learn a joint distribution. A prominent approach to address the modality coupling problem is Minimum Entropy Coupling (MEC), which seeks to minimize the joint Entropy, while satisfying constraints on the marginals. Existing approaches to the MEC problem focus on finite, discrete distributions, limiting their application for cases involving continuous data. In this work, we propose a novel method to solve the continuous MEC problem, using well-known generative diffusion models that learn to approximate and minimize the joint Entropy through a cooperative scheme, while satisfying a relaxed version of the marginal constraints. We empirically demonstrate that our method, DDMEC, is general and can be easily used to address challenging tasks, including unsupervised single-cell multi-omics data alignment and unpaired image translation, outperforming specialized methods.

LGFeb 8, 2024
S$Ω$I: Score-based O-INFORMATION Estimation

Mustapha Bounoua, Giulio Franzese, Pietro Michiardi

The analysis of scientific data and complex multivariate systems requires information quantities that capture relationships among multiple random variables. Recently, new information-theoretic measures have been developed to overcome the shortcomings of classical ones, such as mutual information, that are restricted to considering pairwise interactions. Among them, the concept of information synergy and redundancy is crucial for understanding the high-order dependencies between variables. One of the most prominent and versatile measures based on this concept is O-information, which provides a clear and scalable way to quantify the synergy-redundancy balance in multivariate systems. However, its practical application is limited to simplified cases. In this work, we introduce S$Ω$I, which allows for the first time to compute O-information without restrictive assumptions about the system. Our experiments validate our approach on synthetic data, and demonstrate the effectiveness of S$Ω$I in the context of a real-world use case.

CVMar 18, 2025
RFMI: Estimating Mutual Information on Rectified Flow for Text-to-Image Alignment

Chao Wang, Giulio Franzese, Alessandro Finamore et al.

Rectified Flow (RF) models trained with a Flow matching framework have achieved state-of-the-art performance on Text-to-Image (T2I) conditional generation. Yet, multiple benchmarks show that synthetic images can still suffer from poor alignment with the prompt, i.e., images show wrong attribute binding, subject positioning, numeracy, etc. While the literature offers many methods to improve T2I alignment, they all consider only Diffusion Models, and require auxiliary datasets, scoring models, and linguistic analysis of the prompt. In this paper we aim to address these gaps. First, we introduce RFMI, a novel Mutual Information (MI) estimator for RF models that uses the pre-trained model itself for the MI estimation. Then, we investigate a self-supervised fine-tuning approach for T2I alignment based on RFMI that does not require auxiliary information other than the pre-trained model itself. Specifically, a fine-tuning set is constructed by selecting synthetic images generated from the pre-trained RF model and having high point-wise MI between images and prompts. Our experiments on MI estimation benchmarks demonstrate the validity of RFMI, and empirical fine-tuning on SD3.5-Medium confirms the effectiveness of RFMI for improving T2I alignment while maintaining image quality.

CLFeb 5, 2025
In Praise of Stubbornness: An Empirical Case for Cognitive-Dissonance Aware Continual Update of Knowledge in LLMs

Simone Clemente, Zied Ben Houidi, Alexis Huet et al.

Through systematic empirical investigation, we uncover a fundamental and concerning property of Large Language Models: while they can safely learn facts that don't contradict their knowledge, attempting to update facts with contradictory information triggers catastrophic corruption of unrelated knowledge. Unlike humans, who naturally resist contradictory information, these models indiscriminately accept contradictions, leading to devastating interference, destroying up to 80% of unrelated knowledge even when learning as few as 10-100 contradicting facts. To understand whether this interference could be mitigated through selective plasticity, we experiment with targeted network updates, distinguishing between previously used (stubborn) and rarely used (plastic) neurons. We uncover another asymmetry: while sparing frequently-used neurons significantly improves retention of existing knowledge for non-contradictory updates (98% vs 93% with standard updates), contradictory updates trigger catastrophic interference regardless of targeting strategy. This effect which persists across tested model scales (GPT-2 to GPT-J-6B), suggests a fundamental limitation in how neural networks handle contradictions. Finally, we demonstrate that contradictory information can be reliably detected (95%+ accuracy) using simple model features, offering a potential protective mechanism. These findings motivate new architectures that can, like humans, naturally resist contradictions rather than allowing destructive overwrites.

LGOct 15, 2025
TENDE: Transfer Entropy Neural Diffusion Estimation

Simon Pedro Galeano Munoz, Mustapha Bounoua, Giulio Franzese et al.

Transfer entropy measures directed information flow in time series, and it has become a fundamental quantity in applications spanning neuroscience, finance, and complex systems analysis. However, existing estimation methods suffer from the curse of dimensionality, require restrictive distributional assumptions, or need exponentially large datasets for reliable convergence. We address these limitations in the literature by proposing TENDE (Transfer Entropy Neural Diffusion Estimation), a novel approach that leverages score-based diffusion models to estimate transfer entropy through conditional mutual information. By learning score functions of the relevant conditional distributions, TENDE provides flexible, scalable estimation while making minimal assumptions about the underlying data-generating process. We demonstrate superior accuracy and robustness compared to existing neural estimators and other state-of-the-art approaches across synthetic benchmarks and real data.

LGFeb 26, 2025
INFO-SEDD: Continuous Time Markov Chains as Scalable Information Metrics Estimators

Alberto Foresti, Giulio Franzese, Pietro Michiardi

Information-theoretic quantities play a crucial role in understanding non-linear relationships between random variables and are widely used across scientific disciplines. However, estimating these quantities remains an open problem, particularly in the case of high-dimensional discrete distributions. Current approaches typically rely on embedding discrete data into a continuous space and applying neural estimators originally designed for continuous distributions, a process that may not fully capture the discrete nature of the underlying data. We consider Continuous-Time Markov Chains (CTMCs), stochastic processes on discrete state-spaces which have gained popularity due to their generative modeling applications. In this work, we introduce INFO-SEDD, a novel method for estimating information-theoretic quantities of discrete data, including mutual information and entropy. Our approach requires the training of a single parametric model, offering significant computational and memory advantages. Additionally, it seamlessly integrates with pretrained networks, allowing for efficient reuse of pretrained generative models. To evaluate our approach, we construct a challenging synthetic benchmark. Our experiments demonstrate that INFO-SEDD is robust and outperforms neural competitors that rely on embedding techniques. Moreover, we validate our method on a real-world task: estimating the entropy of an Ising model. Overall, INFO-SEDD outperforms competing methods and shows scalability to high-dimensional scenarios, paving the way for new applications where estimating MI between discrete distribution is the focus. The promising results in this complex, high-dimensional scenario highlight INFO-SEDD as a powerful new estimator in the toolkit for information-theoretical analysis.

LGJan 19, 2024
Data Augmentation for Traffic Classification

Chao Wang, Alessandro Finamore, Pietro Michiardi et al.

Data Augmentation (DA) -- enriching training data by adding synthetic samples -- is a technique widely adopted in Computer Vision (CV) and Natural Language Processing (NLP) tasks to improve models performance. Yet, DA has struggled to gain traction in networking contexts, particularly in Traffic Classification (TC) tasks. In this work, we fulfill this gap by benchmarking 18 augmentation functions applied to 3 TC datasets using packet time series as input representation and considering a variety of training conditions. Our results show that (i) DA can reap benefits previously unexplored, (ii) augmentations acting on time series sequence order and masking are better suited for TC than amplitude augmentations and (iii) basic models latent space analysis can help understanding the positive/negative effects of augmentations on classification performance.

LGMay 30, 2023
One-Line-of-Code Data Mollification Improves Optimization of Likelihood-based Generative Models

Ba-Hien Tran, Giulio Franzese, Pietro Michiardi et al.

Generative Models (GMs) have attracted considerable attention due to their tremendous success in various domains, such as computer vision where they are capable to generate impressive realistic-looking images. Likelihood-based GMs are attractive due to the possibility to generate new data by a single model evaluation. However, they typically achieve lower sample quality compared to state-of-the-art score-based diffusion models (DMs). This paper provides a significant step in the direction of addressing this limitation. The idea is to borrow one of the strengths of score-based DMs, which is the ability to perform accurate density estimation in low-density regions and to address manifold overfitting by means of data mollification. We connect data mollification through the addition of Gaussian noise to Gaussian homotopy, which is a well-known technique to improve optimization. Data mollification can be implemented by adding one line of code in the optimization loop, and we demonstrate that this provides a boost in generation quality of likelihood-based GMs, without computational overheads. We report results on image data sets with popular likelihood-based GMs, including variants of variational autoencoders and normalizing flows, showing large improvements in FID score.

LGSep 21, 2021
Improved optimization strategies for deep Multi-Task Networks

Lucas Pascal, Pietro Michiardi, Xavier Bost et al.

In Multi-Task Learning (MTL), it is a common practice to train multi-task networks by optimizing an objective function, which is a weighted average of the task-specific objective functions. Although the computational advantages of this strategy are clear, the complexity of the resulting loss landscape has not been studied in the literature. Arguably, its optimization may be more difficult than a separate optimization of the constituting task-specific objectives. In this work, we investigate the benefits of such an alternative, by alternating independent gradient descent steps on the different task-specific objective functions and we formulate a novel way to combine this approach with state-of-the-art optimizers. As the separation of task-specific objectives comes at the cost of increased computational time, we propose a random task grouping as a trade-off between better optimization and computational efficiency. Experimental results over three well-known visual MTL datasets show better overall absolute performance on losses and standard metrics compared to an averaged objective function and other state-of-the-art MTL methods. In particular, our method shows the most benefits when dealing with tasks of different nature and it enables a wider exploration of the shared parameter space. We also show that our random grouping strategy allows to trade-off between these benefits and computational efficiency.

LGJun 30, 2021
Revisiting the Effects of Stochasticity for Hamiltonian Samplers

Giulio Franzese, Dimitrios Milios, Maurizio Filippone et al.

We revisit the theoretical properties of Hamiltonian stochastic differential equations (SDES) for Bayesian posterior sampling, and we study the two types of errors that arise from numerical SDE simulation: the discretization error and the error due to noisy gradient estimates in the context of data subsampling. Our main result is a novel analysis for the effect of mini-batches through the lens of differential operator splitting, revising previous literature results. The stochastic component of a Hamiltonian SDE is decoupled from the gradient noise, for which we make no normality assumptions. This leads to the identification of a convergence bottleneck: when considering mini-batches, the best achievable error rate is $\mathcal{O}(η^2)$, with $η$ being the integrator step size. Our theoretical results are supported by an empirical study on a variety of regression and classification tasks for Bayesian neural networks.

MLJun 11, 2021
Model Selection for Bayesian Autoencoders

Ba-Hien Tran, Simone Rossi, Dimitrios Milios et al.

We develop a novel method for carrying out model selection for Bayesian autoencoders (BAEs) by means of prior hyper-parameter optimization. Inspired by the common practice of type-II maximum likelihood optimization and its equivalence to Kullback-Leibler divergence minimization, we propose to optimize the distributional sliced-Wasserstein distance (DSWD) between the output of the autoencoder and the empirical data distribution. The advantages of this formulation are that we can estimate the DSWD based on samples and handle high-dimensional problems. We carry out posterior estimation of the BAE parameters via stochastic gradient Hamiltonian Monte Carlo and turn our BAE into a generative model by fitting a flexible Dirichlet mixture model in the latent space. Consequently, we obtain a powerful alternative to variational autoencoders, which are the preferred choice in modern applications of autoencoders for representation learning with uncertainty. We evaluate our approach qualitatively and quantitatively using a vast experimental campaign on a number of unsupervised learning tasks and show that, in small-data regimes where priors matter, our approach provides state-of-the-art results, outperforming multiple competitive baselines.

MLNov 10, 2020
Sparse within Sparse Gaussian Processes using Neighbor Information

Gia-Lac Tran, Dimitrios Milios, Pietro Michiardi et al.

Approximations to Gaussian processes based on inducing variables, combined with variational inference techniques, enable state-of-the-art sparse approaches to infer GPs at scale through mini batch-based learning. In this work, we address one limitation of sparse GPs, which is due to the challenge in dealing with a large number of inducing variables without imposing a special structure on the inducing inputs. In particular, we introduce a novel hierarchical prior, which imposes sparsity on the set of inducing variables. We treat our model variationally, and we experimentally show considerable computational gains compared to standard sparse GPs when sparsity on the inducing variables is realized considering the nearest inducing inputs of a random mini-batch of the data. We perform an extensive experimental validation that demonstrates the effectiveness of our approach compared to the state-of-the-art. Our approach enables the possibility to use sparse GPs using a large number of inducing points without incurring a prohibitive computational cost.

LGOct 19, 2020
An Identifiable Double VAE For Disentangled Representations

Graziano Mita, Maurizio Filippone, Pietro Michiardi

A large part of the literature on learning disentangled representations focuses on variational autoencoders (VAE). Recent developments demonstrate that disentanglement cannot be obtained in a fully unsupervised setting without inductive biases on models and data. However, Khemakhem et al., AISTATS, 2020 suggest that employing a particular form of factorized prior, conditionally dependent on auxiliary variables complementing input observations, can be one such bias, resulting in an identifiable model with guarantees on disentanglement. Working along this line, we propose a novel VAE-based generative model with theoretical guarantees on identifiability. We obtain our conditional prior over the latents by learning an optimal representation, which imposes an additional strength on their regularization. We also extend our method to semi-supervised settings. Experimental results indicate superior performance with respect to state-of-the-art approaches, according to several established metrics proposed in the literature on disentanglement.

CVJun 17, 2020
Maximum Roaming Multi-Task Learning

Lucas Pascal, Pietro Michiardi, Xavier Bost et al.

Multi-task learning has gained popularity due to the advantages it provides with respect to resource usage and performance. Nonetheless, the joint optimization of parameters with respect to multiple tasks remains an active research topic. Sub-partitioning the parameters between different tasks has proven to be an efficient way to relax the optimization constraints over the shared weights, may the partitions be disjoint or overlapping. However, one drawback of this approach is that it can weaken the inductive bias generally set up by the joint task optimization. In this work, we present a novel way to partition the parameter space without weakening the inductive bias. Specifically, we propose Maximum Roaming, a method inspired by dropout that randomly varies the parameter partitioning, while forcing them to visit as many tasks as possible at a regulated frequency, so that the network fully adapts to each update. We study the properties of our method through experiments on a variety of visual multi-task data sets. Experimental results suggest that the regularization brought by roaming has more impact on performance than usual partitioning optimization strategies. The overall method is flexible, easily applicable, provides superior regularization and consistently achieves improved performances compared to recent multi-task learning formulations.

LGJun 9, 2020
Isotropic SGD: a Practical Approach to Bayesian Posterior Sampling

Giulio Franzese, Rosa Candela, Dimitrios Milios et al.

In this work we define a unified mathematical framework to deepen our understanding of the role of stochastic gradient (SG) noise on the behavior of Markov chain Monte Carlo sampling (SGMCMC) algorithms. Our formulation unlocks the design of a novel, practical approach to posterior sampling, which makes the SG noise isotropic using a fixed learning rate that we determine analytically, and that requires weaker assumptions than existing algorithms. In contrast, the common traits of existing \sgmcmc algorithms is to approximate the isotropy condition either by drowning the gradients in additive noise (annealing the learning rate) or by making restrictive assumptions on the \sg noise covariance and the geometry of the loss landscape. Extensive experimental validations indicate that our proposal is competitive with the state-of-the-art on \sgmcmc, while being much more practical to use.

LGJun 8, 2020
A Variational View on Bootstrap Ensembles as Bayesian Inference

Dimitrios Milios, Pietro Michiardi, Maurizio Filippone

In this paper, we employ variational arguments to establish a connection between ensemble methods for Neural Networks and Bayesian inference. We consider an ensemble-based scheme where each model/particle corresponds to a perturbation of the data by means of parametric bootstrap and a perturbation of the prior. We derive conditions under which any optimization steps of the particles makes the associated distribution reduce its divergence to the posterior over model parameters. Such conditions do not require any particular form for the approximation and they are purely geometrical, giving insights on the behavior of the ensemble on a number of interesting models such as Neural Networks with ReLU activations. Experiments confirm that ensemble methods can be a valid alternative to approximate Bayesian inference; the theoretical developments in the paper seek to explain this behavior.

LGNov 15, 2019
LIBRE: Learning Interpretable Boolean Rule Ensembles

Graziano Mita, Paolo Papotti, Maurizio Filippone et al.

We present a novel method - LIBRE - to learn an interpretable classifier, which materializes as a set of Boolean rules. LIBRE uses an ensemble of bottom-up weak learners operating on a random subset of features, which allows for the learning of rules that generalize well on unseen data even in imbalanced settings. Weak learners are combined with a simple union so that the final ensemble is also interpretable. Experimental results indicate that LIBRE efficiently strikes the right balance between prediction accuracy, which is competitive with black box methods, and interpretability, which is often superior to alternative methods from the literature.

LGOct 21, 2019
Sparsification as a Remedy for Staleness in Distributed Asynchronous SGD

Rosa Candela, Giulio Franzese, Maurizio Filippone et al.

Large scale machine learning is increasingly relying on distributed optimization, whereby several machines contribute to the training process of a statistical model. In this work we study the performance of asynchronous, distributed settings, when applying sparsification, a technique used to reduce communication overheads. In particular, for the first time in an asynchronous, non-convex setting, we theoretically prove that, in presence of staleness, sparsification does not harm SGD performance: the ergodic convergence rate matches the known result of standard SGD, that is $\mathcal{O} \left( 1/\sqrt{T} \right)$. We also carry out an empirical study to complement our theory, and confirm that the effects of sparsification on the convergence rate are negligible, when compared to 'vanilla' SGD, even in the challenging scenario of an asynchronous, distributed system.

LGFeb 28, 2019
A comparative evaluation of novelty detection algorithms for discrete sequences

Rémi Domingues, Pietro Michiardi, Jérémie Barlet et al.

The identification of anomalies in temporal data is a core component of numerous research areas such as intrusion detection, fault prevention, genomics and fraud detection. This article provides an experimental comparison of the novelty detection problem applied to discrete sequences. The objective of this study is to identify which state-of-the-art methods are efficient and appropriate candidates for a given use case. These recommendations rely on extensive novelty detection experiments based on a variety of public datasets in addition to novel industrial datasets. We also perform thorough scalability and memory usage tests resulting in new supplementary insights of the methods' performance, key selection criterion to solve problems relying on large volumes of data and to meet the expectations of applications subject to strict response time constraints.

MLOct 18, 2018
Good Initializations of Variational Bayes for Deep Models

Simone Rossi, Pietro Michiardi, Maurizio Filippone

Stochastic variational inference is an established way to carry out approximate Bayesian inference for deep models. While there have been effective proposals for good initializations for loss minimization in deep learning, far less attention has been devoted to the issue of initialization of stochastic variational inference. We address this by proposing a novel layer-wise initialization strategy based on Bayesian linear models. The proposed method is extensively validated on regression and classification tasks, including Bayesian DeepNets and ConvNets, showing faster and better convergence compared to alternatives inspired by the literature on initializations for loss minimization.

LGMay 28, 2018
Dirichlet-based Gaussian Processes for Large-scale Calibrated Classification

Dimitrios Milios, Raffaello Camoriano, Pietro Michiardi et al.

In this paper, we study the problem of deriving fast and accurate classification algorithms with uncertainty quantification. Gaussian process classification provides a principled approach, but the corresponding computational burden is hardly sustainable in large-scale problems and devising efficient alternatives is a challenge. In this work, we investigate if and how Gaussian process regression directly applied to the classification labels can be used to tackle this question. While in this case training time is remarkably faster, predictions need be calibrated for classification and uncertainty estimation. To this aim, we propose a novel approach based on interpreting the labels as the output of a Dirichlet distribution. Extensive experimental results show that the proposed approach provides essentially the same accuracy and uncertainty quantification of Gaussian process classification while requiring only a fraction of computational resources.

MLMay 26, 2018
Calibrating Deep Convolutional Gaussian Processes

Gia-Lac Tran, Edwin V. Bonilla, John P. Cunningham et al.

The wide adoption of Convolutional Neural Networks (CNNs) in applications where decision-making under uncertainty is fundamental, has brought a great deal of attention to the ability of these models to accurately quantify the uncertainty in their predictions. Previous work on combining CNNs with Gaussian processes (GPs) has been developed under the assumption that the predictive probabilities of these models are well-calibrated. In this paper we show that, in fact, current combinations of CNNs and GPs are miscalibrated. We proposes a novel combination that considerably outperforms previous approaches on this aspect, while achieving state-of-the-art performance on image classification tasks.

MLOct 14, 2016
Random Feature Expansions for Deep Gaussian Processes

Kurt Cutajar, Edwin V. Bonilla, Pietro Michiardi et al.

The composition of multiple Gaussian Processes as a Deep Gaussian Process (DGP) enables a deep probabilistic nonparametric approach to flexibly tackle complex machine learning problems with sound quantification of uncertainty. Existing inference approaches for DGP models have limited scalability and are notoriously cumbersome to construct. In this work, we introduce a novel formulation of DGPs based on random feature expansions that we train using stochastic variational inference. This yields a practical learning framework which significantly advances the state-of-the-art in inference for DGPs, and enables accurate quantification of uncertainty. We extensively showcase the scalability and performance of our proposal on several datasets with up to 8 million observations, and various DGP architectures with up to 30 hidden layers.