MLNov 3, 2023
Learning Sparse Codes with Entropy-Based ELBOsDmytro Velychko, Simon Damm, Asja Fischer et al.
Standard probabilistic sparse coding assumes a Laplace prior, a linear mapping from latents to observables, and Gaussian observable distributions. We here derive a solely entropy-based learning objective for the parameters of standard sparse coding. The novel variational objective has the following features: (A) unlike MAP approximations, it uses non-trivial posterior approximations for probabilistic inference; (B) unlike for previous non-trivial approximations, the novel objective is fully analytical; and (C) the objective allows for a novel principled form of annealing. The objective is derived by first showing that the standard ELBO objective converges to a sum of entropies, which matches similar recent results for generative models with Gaussian priors. The conditions under which the ELBO becomes equal to entropies are then shown to have analytical solutions, which leads to the fully analytical objective. Numerical experiments are used to demonstrate the feasibility of learning with such entropy-based ELBOs. We investigate different posterior approximations including Gaussians with correlated latents and deep amortized approximations. Furthermore, we numerically investigate entropy-based annealing which results in improved learning. Our main contributions are theoretical, however, and they are twofold: (1) for non-trivial posterior approximations, we provide the (to the knowledge of the authors) first analytical ELBO objective for standard probabilistic sparse coding; and (2) we provide the first demonstration on how a recently shown convergence of the ELBO to entropy sums can be used for learning.
MLSep 7, 2022
On the Convergence of the ELBO to Entropy SumsJörg Lücke, Jan Warnken
The variational lower bound (a.k.a. ELBO or free energy) is the central objective for many established as well as for many novel algorithms for unsupervised learning. Such algorithms usually increase the bound until parameters have converged to values close to a stationary point of the learning dynamics. Here we show that (for a very large class of generative models) the variational lower bound is at all stationary points of learning equal to a sum of entropies. Concretely, for standard generative models with one set of latents and one set of observed variables, the sum consists of three entropies: (A) the (average) entropy of the variational distributions, (B) the negative entropy of the model's prior distribution, and (C) the (expected) negative entropy of the observable distribution. The obtained result applies under realistic conditions including: finite numbers of data points, at any stationary point (including saddle points) and for any family of (well behaved) variational distributions. The class of generative models for which we show the equality to entropy sums contains many standard as well as novel generative models including standard (Gaussian) variational autoencoders. The prerequisites we use to show equality to entropy sums are relatively mild. Concretely, the distributions defining a given generative model have to be of the exponential family, and the model has to satisfy a parameterization criterion (which is usually fulfilled). Proving equality of the ELBO to entropy sums at stationary points (under the stated conditions) is the main contribution of this work.
MLOct 17, 2025
Disentanglement of Sources in a Multi-Stream Variational AutoencoderVeranika Boukun, Jörg Lücke
Variational autoencoders (VAEs) are a leading approach to address the problem of learning disentangled representations. Typically a single VAE is used and disentangled representations are sought in its continuous latent space. Here we explore a different approach by using discrete latents to combine VAE-representations of individual sources. The combination is done based on an explicit model for source combination, and we here use a linear combination model which is well suited, e.g., for acoustic data. We formally define such a multi-stream VAE (MS-VAE) approach, derive its inference and learning equations, and we numerically investigate its principled functionality. The MS-VAE is domain-agnostic, and we here explore its ability to separate sources into different streams using superimposed hand-written digits, and mixed acoustic sources in a speaker diarization task. We observe a clear separation of digits, and on speaker diarization we observe an especially low rate of missed speakers. Numerical experiments further highlight the flexibility of the approach across varying amounts of supervision and training data.
MLJan 21, 2025
Sublinear Variational Optimization of Gaussian Mixture Models with Millions to Billions of ParametersSebastian Salwig, Till Kahlke, Florian Hirschberger et al.
Gaussian Mixture Models (GMMs) range among the most frequently used machine learning models. However, training large, general GMMs becomes computationally prohibitive for datasets with many data points $N$ of high-dimensionality $D$. For GMMs with arbitrary covariances, we here derive a highly efficient variational approximation, which is integrated with mixtures of factor analyzers (MFAs). For GMMs with $C$ components, our proposed algorithm significantly reduces runtime complexity per iteration from $\mathcal{O}(NCD^2)$ to a complexity scaling linearly with $D$ and remaining constant w.r.t. $C$. Numerical validation of this theoretical complexity reduction then shows the following: the distance evaluations required for the entire GMM optimization process scale sublinearly with $NC$. On large-scale benchmarks, this sublinearity results in speed-ups of an order-of-magnitude compared to the state-of-the-art. As a proof of concept, we train GMMs with over 10 billion parameters on about 100 million images, and observe training times of approximately nine hours on a single state-of-the-art CPU.
MLDec 25, 2024
Generative Models with ELBOs Converging to Entropy SumsJan Warnken, Dmytro Velychko, Simon Damm et al.
The evidence lower bound (ELBO) is one of the most central objectives for probabilistic unsupervised learning. For the ELBOs of several generative models and model classes, we here prove convergence to entropy sums. As one result, we provide a list of generative models for which entropy convergence has been shown, so far, along with the corresponding expressions for entropy sums. Our considerations include very prominent generative models such as probabilistic PCA, sigmoid belief nets or Gaussian mixture models. However, we treat more models and entire model classes such as general mixtures of exponential family distributions. Our main contributions are the proofs for the individual models. For each given model we show that the conditions stated in Theorem 1 or Theorem 2 of [arXiv:2209.03077] are fulfilled such that by virtue of the theorems the given model's ELBO is equal to an entropy sum at all stationary points. The equality of the ELBO at stationary points applies under realistic conditions: for finite numbers of data points, for model/data mismatches, at any stationary point including saddle points etc, and it applies for any well behaved family of variational distributions.
MLDec 22, 2020
Evolutionary Variational Optimization of Generative ModelsJakob Drefs, Enrico Guiraud, Jörg Lücke
We combine two popular optimization approaches to derive learning algorithms for generative models: variational optimization and evolutionary algorithms. The combination is realized for generative models with discrete latents by using truncated posteriors as the family of variational distributions. The variational parameters of truncated posteriors are sets of latent states. By interpreting these states as genomes of individuals and by using the variational lower bound to define a fitness, we can apply evolutionary algorithms to realize the variational loop. The used variational distributions are very flexible and we show that evolutionary algorithms can effectively and efficiently optimize the variational bound. Furthermore, the variational loop is generally applicable ("black box") with no analytical derivations required. To show general applicability, we apply the approach to three generative models (we use noisy-OR Bayes Nets, Binary Sparse Coding, and Spike-and-Slab Sparse Coding). To demonstrate effectiveness and efficiency of the novel variational approach, we use the standard competitive benchmarks of image denoising and inpainting. The benchmarks allow quantitative comparisons to a wide range of methods including probabilistic approaches, deep deterministic and generative networks, and non-local image processing methods. In the category of "zero-shot" learning (when only the corrupted image is used for training), we observed the evolutionary variational algorithm to significantly improve the state-of-the-art in many benchmark settings. For one well-known inpainting benchmark, we also observed state-of-the-art performance across all categories of algorithms although we only train on the corrupted image. In general, our investigations highlight the importance of research on optimization methods for generative models to achieve performance improvements.
MLNov 27, 2020
Direct Evolutionary Optimization of Variational Autoencoders With Binary LatentsEnrico Guiraud, Jakob Drefs, Jörg Lücke
Discrete latent variables are considered important for real world data, which has motivated research on Variational Autoencoders (VAEs) with discrete latents. However, standard VAE training is not possible in this case, which has motivated different strategies to manipulate discrete distributions in order to train discrete VAEs similarly to conventional ones. Here we ask if it is also possible to keep the discrete nature of the latents fully intact by applying a direct discrete optimization for the encoding model. The approach is consequently strongly diverting from standard VAE-training by sidestepping sampling approximation, reparameterization trick and amortization. Discrete optimization is realized in a variational setting using truncated posteriors in conjunction with evolutionary algorithms. For VAEs with binary latents, we (A) show how such a discrete variational method ties into gradient ascent for network weights, and (B) how the decoder is used to select latent states for training. Conventional amortized training is more efficient and applicable to large neural networks. However, using smaller networks, we here find direct discrete optimization to be efficiently scalable to hundreds of latents. More importantly, we find the effectiveness of direct optimization to be highly competitive in `zero-shot' learning. In contrast to large supervised networks, the here investigated VAEs can, e.g., denoise a single image without previous training on clean data and/or training on large image datasets. More generally, the studied approach shows that training of VAEs is indeed possible without sampling-based approximation and reparameterization, which may be interesting for the analysis of VAE-training in general. For `zero-shot' settings a direct optimization, furthermore, makes VAEs competitive where they have previously been outperformed by non-generative approaches.
MLOct 28, 2020
The ELBO of Variational Autoencoders Converges to a Sum of Three EntropiesSimon Damm, Dennis Forster, Dmytro Velychko et al.
The central objective function of a variational autoencoder (VAE) is its variational lower bound (the ELBO). Here we show that for standard (i.e., Gaussian) VAEs the ELBO converges to a value given by the sum of three entropies: the (negative) entropy of the prior distribution, the expected (negative) entropy of the observable distribution, and the average entropy of the variational distributions (the latter is already part of the ELBO). Our derived analytical results are exact and apply for small as well as for intricate deep networks for encoder and decoder. Furthermore, they apply for finitely and infinitely many data points and at any stationary point (including local maxima and saddle points). The result implies that the ELBO can for standard VAEs often be computed in closed-form at stationary points while the original ELBO requires numerical approximations of integrals. As a main contribution, we provide the proof that the ELBO for VAEs is at stationary points equal to entropy sums. Numerical experiments then show that the obtained analytical results are sufficiently precise also in those vicinities of stationary points that are reached in practice. Furthermore, we discuss how the novel entropy form of the ELBO can be used to analyze and understand learning behavior. More generally, we believe that our contributions can be useful for future theoretical and practical studies on VAE learning as they provide novel information on those points in parameters space that optimization of VAEs converges to.
LGMar 4, 2020
Generic Unsupervised Optimization for a Latent Variable Model With Exponential Family ObservablesHamid Mousavi, Jakob Drefs, Florian Hirschberger et al.
Latent variable models (LVMs) represent observed variables by parameterized functions of latent variables. Prominent examples of LVMs for unsupervised learning are probabilistic PCA or probabilistic SC which both assume a weighted linear summation of the latents to determine the mean of a Gaussian distribution for the observables. In many cases, however, observables do not follow a Gaussian distribution. For unsupervised learning, LVMs which assume specific non-Gaussian observables have therefore been considered. Already for specific choices of distributions, parameter optimization is challenging and only a few previous contributions considered LVMs with more generally defined observable distributions. Here, we consider LVMs that are defined for a range of different distributions, i.e., observables can follow any (regular) distribution of the exponential family. The novel class of LVMs presented is defined for binary latents, and it uses maximization in place of summation to link the latents to observables. To derive an optimization procedure, we follow an EM approach for maximum likelihood parameter estimation. We show that a set of very concise parameter update equations can be derived which feature the same functional form for all exponential family distributions. The derived generic optimization can consequently be applied to different types of metric data as well as to different types of discrete data. Also, the derived optimization equations can be combined with a recently suggested variational acceleration which is likewise generically applicable to the LVMs considered here. So, the combination maintains generic and direct applicability of the derived optimization procedure, but, crucially, enables efficient scalability. We numerically verify our analytical results and discuss some potential applications such as learning of variance structure, noise type estimation and denoising.
SPAug 1, 2019
ProSper -- A Python Library for Probabilistic Sparse Coding with Non-Standard Priors and SuperpositionsGeorgios Exarchakis, Jörg Bornschein, Abdul-Saboor Sheikh et al.
ProSper is a python library containing probabilistic algorithms to learn dictionaries. Given a set of data points, the implemented algorithms seek to learn the elementary components that have generated the data. The library widens the scope of dictionary learning approaches beyond implementations of standard approaches such as ICA, NMF or standard L1 sparse coding. The implemented algorithms are especially well-suited in cases when data consist of components that combine non-linearly and/or for data requiring flexible prior distributions. Furthermore, the implemented algorithms go beyond standard approaches by inferring prior and noise parameters of the data, and they provide rich a-posteriori approximations for inference. The library is designed to be extendable and it currently includes: Binary Sparse Coding (BSC), Ternary Sparse Coding (TSC), Discrete Sparse Coding (DSC), Maximal Causes Analysis (MCA), Maximum Magnitude Causes Analysis (MMCA), and Gaussian Sparse Coding (GSC, a recent spike-and-slab sparse coding approach). The algorithms are scalable due to a combination of variational approximations and parallelization. Implementations of all algorithms allow for parallel execution on multiple CPUs and multiple machines for medium to large-scale applications. Typical large-scale runs of the algorithms can use hundreds of CPUs to learn hundreds of dictionary elements from data with tens of millions of floating-point numbers such that models with several hundred thousand parameters can be optimized. The library is designed to have minimal dependencies and to be easy to use. It targets users of dictionary learning algorithms and Machine Learning researchers.
MLOct 1, 2018
Large Scale Clustering with Variational EM for Gaussian Mixture ModelsFlorian Hirschberger, Dennis Forster, Jörg Lücke
This paper represents a preliminary (pre-reviewing) version of a sublinear variational algorithm for isotropic Gaussian mixture models (GMMs). Further developments of the algorithm for GMMs with diagonal covariance matrices (instead of isotropic clusters) and their corresponding benchmarking results have been published by TPAMI (doi:10.1109/TPAMI.2021.3133763) in the paper "A Variational EM Acceleration for Efficient Clustering at Very Large Scales". We kindly refer the reader to the TPAMI paper instead of this much earlier arXiv version (the TPAMI paper is also open access). Publicly available source code accompanies the paper (see https://github.com/variational-sublinear-clustering). Please note that the TPAMI paper does not contain the benchmark on the 80 Million Tiny Images dataset anymore because we followed the call of the dataset creators to discontinue the use of that dataset. The aim of the project (which resulted in this arXiv version and the later TPAMI paper) is the exploration of the current efficiency and large-scale limits in fitting a parametric model for clustering to data distributions. To reduce computational complexity, we used a clustering objective based on truncated variational EM (which reduces complexity for many clusters) in combination with coreset objectives (which reduce complexity for many data points). We used efficient coreset construction and efficient seeding to translate the theoretical sublinear complexity gains into an efficient algorithm. In applications to standard large-scale benchmarks for clustering, we then observed substantial wall-clock speedups compared to already highly efficient clustering approaches. To demonstrate that the observed efficiency enables applications previously considered unfeasible, we clustered the entire and unscaled 80 Million Tiny Images dataset into up to 32,000 clusters.
MLDec 21, 2017
Truncated Variational Sampling for "Black Box" Optimization of Generative ModelsJörg Lücke, Zhenwen Dai, Georgios Exarchakis
We investigate the optimization of two probabilistic generative models with binary latent variables using a novel variational EM approach. The approach distinguishes itself from previous variational approaches by using latent states as variational parameters. Here we use efficient and general purpose sampling procedures to vary the latent states, and investigate the "black box" applicability of the resulting optimization procedure. For general purpose applicability, samples are drawn from approximate marginal distributions of the considered generative model as well as from the model's prior distribution. As such, variational sampling is defined in a generic form, and is directly executable for a given model. As a proof of concept, we then apply the novel procedure (A) to Binary Sparse Coding (a model with continuous observables), and (B) to basic Sigmoid Belief Networks (which are models with binary observables). Numerical experiments verify that the investigated approach efficiently as well as effectively increases a variational free energy objective without requiring any additional analytical steps.
MLNov 9, 2017
Can clustering scale sublinearly with its clusters? A variational EM acceleration of GMMs and $k$-meansDennis Forster, Jörg Lücke
One iteration of standard $k$-means (i.e., Lloyd's algorithm) or standard EM for Gaussian mixture models (GMMs) scales linearly with the number of clusters $C$, data points $N$, and data dimensionality $D$. In this study, we explore whether one iteration of $k$-means or EM for GMMs can scale sublinearly with $C$ at run-time, while improving the clustering objective remains effective. The tool we apply for complexity reduction is variational EM, which is typically used to make training of generative models with exponentially many hidden states tractable. Here, we apply novel theoretical results on truncated variational EM to make tractable clustering algorithms more efficient. The basic idea is to use a partial variational E-step which reduces the linear complexity of $\mathcal{O}(NCD)$ required for a full E-step to a sublinear complexity. Our main observation is that the linear dependency on $C$ can be reduced to a dependency on a much smaller parameter $G$ which relates to cluster neighborhood relations. We focus on two versions of partial variational EM for clustering: variational GMM, scaling with $\mathcal{O}(NG^2D)$, and variational $k$-means, scaling with $\mathcal{O}(NGD)$ per iteration. Empirical results show that these algorithms still require comparable numbers of iterations to improve the clustering objective to same values as $k$-means. For data with many clusters, we consequently observe reductions of net computational demands between two and three orders of magnitude. More generally, our results provide substantial empirical evidence in favor of clustering to scale sublinearly with $C$.
MLApr 16, 2017
$k$-means as a variational EM approximation of Gaussian mixture modelsJörg Lücke, Dennis Forster
We show that $k$-means (Lloyd's algorithm) is obtained as a special case when truncated variational EM approximations are applied to Gaussian Mixture Models (GMM) with isotropic Gaussians. In contrast to the standard way to relate $k$-means and GMMs, the provided derivation shows that it is not required to consider Gaussians with small variances or the limit case of zero variances. There are a number of consequences that directly follow from our approach: (A) $k$-means can be shown to increase a free energy associated with truncated distributions and this free energy can directly be reformulated in terms of the $k$-means objective; (B) $k$-means generalizations can directly be derived by considering the 2nd closest, 3rd closest etc. cluster in addition to just the closest one; and (C) the embedding of $k$-means into a free energy framework allows for theoretical interpretations of other $k$-means generalizations in the literature. In general, truncated variational EM provides a natural and rigorous quantitative link between $k$-means-like clustering and GMM clustering algorithms which may be very relevant for future theoretical and empirical studies.
MLFeb 7, 2017
Truncated Variational EM for Semi-Supervised Neural SimpletronsDennis Forster, Jörg Lücke
Inference and learning for probabilistic generative networks is often very challenging and typically prevents scalability to as large networks as used for deep discriminative approaches. To obtain efficiently trainable, large-scale and well performing generative networks for semi-supervised learning, we here combine two recent developments: a neural network reformulation of hierarchical Poisson mixtures (Neural Simpletrons), and a novel truncated variational EM approach (TV-EM). TV-EM provides theoretical guarantees for learning in generative networks, and its application to Neural Simpletrons results in particularly compact, yet approximately optimal, modifications of learning equations. If applied to standard benchmarks, we empirically find, that learning converges in fewer EM iterations, that the complexity per EM iteration is reduced, and that final likelihood values are higher on average. For the task of classification on data sets with few labels, learning improvements result in consistently lower error rates if compared to applications without truncation. Experiments on the MNIST data set herein allow for comparison to standard and state-of-the-art models in the semi-supervised setting. Further experiments on the NIST SD19 data set show the scalability of the approach when a manifold of additional unlabeled data is available.
MLOct 10, 2016
Truncated Variational Expectation MaximizationJörg Lücke
We derive a novel variational expectation maximization approach based on truncated posterior distributions. Truncated distributions are proportional to exact posteriors within subsets of a discrete state space and equal zero otherwise. The treatment of the distributions' subsets as variational parameters distinguishes the approach from previous variational approaches. The specific structure of truncated distributions allows for deriving novel and mathematically grounded results, which in turn can be used to formulate novel efficient algorithms to optimize the parameters of probabilistic generative models. Most centrally, we find the variational lower bounds that correspond to truncated distributions to be given by very concise and efficiently computable expressions, while update equations for model parameters remain in their standard form. Based on these findings, we show how efficient and easily applicable meta-algorithms can be formulated that guarantee a monotonic increase of the variational bound. Example applications of the here derived framework provide novel theoretical results and learning procedures for latent variable models as well as mixture models. Furthermore, we show that truncated variation EM naturally interpolates between standard EM with full posteriors and EM based on the maximum a-posteriori state (MAP). The approach can, therefore, be regarded as a generalization of the popular `hard EM' approach towards a similarly efficient method which can capture more of the true posterior structure.
MLJun 28, 2015
Neural Simpletrons - Minimalistic Directed Generative Networks for Learning with Few LabelsDennis Forster, Abdul-Saboor Sheikh, Jörg Lücke
Classifiers for the semi-supervised setting often combine strong supervised models with additional learning objectives to make use of unlabeled data. This results in powerful though very complex models that are hard to train and that demand additional labels for optimal parameter tuning, which are often not given when labeled data is very sparse. We here study a minimalistic multi-layer generative neural network for semi-supervised learning in a form and setting as similar to standard discriminative networks as possible. Based on normalized Poisson mixtures, we derive compact and local learning and neural activation rules. Learning and inference in the network can be scaled using standard deep learning tools for parallelized GPU implementation. With the single objective of likelihood optimization, both labeled and unlabeled data are naturally incorporated into learning. Empirical evaluations on standard benchmarks show, that for datasets with few labels the derived minimalistic network improves on all classical deep learning approaches and is competitive with their recent variants without the need of additional labels for parameter tuning. Furthermore, we find that the studied network is the best performing monolithic (`non-hybrid') system for few labels, and that it can be applied in the limit of very few labels, where no other system has been reported to operate so far.
CVJan 12, 2012
Autonomous Cleaning of Corrupted Scanned Documents - A Generative Modeling ApproachZhenwen Dai, Jörg Lücke
We study the task of cleaning scanned text documents that are strongly corrupted by dirt such as manual line strokes, spilled ink etc. We aim at autonomously removing dirt from a single letter-size page based only on the information the page contains. Our approach, therefore, has to learn character representations without supervision and requires a mechanism to distinguish learned representations from irregular patterns. To learn character representations, we use a probabilistic generative model parameterizing pattern features, feature variances, the features' planar arrangements, and pattern frequencies. The latent variables of the model describe pattern class, pattern position, and the presence or absence of individual pattern features. The model parameters are optimized using a novel variational EM approximation. After learning, the parameters represent, independently of their absolute position, planar feature arrangements and their variances. A quality measure defined based on the learned representation then allows for an autonomous discrimination between regular character patterns and the irregular patterns making up the dirt. The irregular patterns can thus be removed to clean the document. For a full Latin alphabet we found that a single page does not contain sufficiently many character examples. However, even if heavily corrupted by dirt, we show that a page containing a lower number of character types can efficiently and autonomously be cleaned solely based on the structural regularity of the characters it contains. In different examples using characters from different alphabets, we demonstrate generality of the approach and discuss its implications for future developments.