Jakub M. Tomczak

LG
h-index31
54papers
5,895citations
Novelty51%
AI Score43

54 Papers

LGDec 23, 2022
A-NeSI: A Scalable Approximate Method for Probabilistic Neurosymbolic Inference

Emile van Krieken, Thiviyan Thanapalasingam, Jakub M. Tomczak et al.

We study the problem of combining neural networks with symbolic reasoning. Recently introduced frameworks for Probabilistic Neurosymbolic Learning (PNL), such as DeepProbLog, perform exponential-time exact inference, limiting the scalability of PNL solutions. We introduce Approximate Neurosymbolic Inference (A-NeSI): a new framework for PNL that uses neural networks for scalable approximate inference. A-NeSI 1) performs approximate inference in polynomial time without changing the semantics of probabilistic logics; 2) is trained using data generated by the background knowledge; 3) can generate symbolic explanations of predictions; and 4) can guarantee the satisfaction of logical constraints at test time, which is vital in safety-critical applications. Our experiments show that A-NeSI is the first end-to-end method to solve three neurosymbolic tasks with exponential combinatorial scaling. Finally, our experiments show that A-NeSI achieves explainability and safety without a penalty in performance.

LGMar 18, 2022
Alleviating Adversarial Attacks on Variational Autoencoders with MCMC

Anna Kuzina, Max Welling, Jakub M. Tomczak

Variational autoencoders (VAEs) are latent variable models that can generate complex objects and provide meaningful latent representations. Moreover, they could be further used in downstream tasks such as classification. As previous work has shown, one can easily fool VAEs to produce unexpected latent representations and reconstructions for a visually slightly modified input. Here, we examine several objective functions for adversarial attack construction proposed previously and present a solution to alleviate the effect of these attacks. Our method utilizes the Markov Chain Monte Carlo (MCMC) technique in the inference step that we motivate with a theoretical analysis. Thus, we do not incorporate any extra costs during training, and the performance on non-attacked inputs is not decreased. We validate our approach on a variety of datasets (MNIST, Fashion MNIST, Color MNIST, CelebA) and VAE configurations ($β$-VAE, NVAE, $β$-TCVAE), and show that our approach consistently improves the model robustness to adversarial attacks.

CVJan 25, 2023
Modelling Long Range Dependencies in $N$D: From Task-Specific to a General Purpose CNN

David M. Knigge, David W. Romero, Albert Gu et al.

Performant Convolutional Neural Network (CNN) architectures must be tailored to specific tasks in order to consider the length, resolution, and dimensionality of the input data. In this work, we tackle the need for problem-specific CNN architectures. We present the Continuous Convolutional Neural Network (CCNN): a single CNN able to process data of arbitrary resolution, dimensionality and length without any structural changes. Its key component are its continuous convolutional kernels which model long-range dependencies at every layer, and thus remove the need of current CNN architectures for task-dependent downsampling and depths. We showcase the generality of our method by using the same architecture for tasks on sequential ($1{\rm D}$), visual ($2{\rm D}$) and point-cloud ($3{\rm D}$) data. Our CCNN matches and often outperforms the current state-of-the-art across all tasks considered.

LGJun 7, 2022
Towards a General Purpose CNN for Long Range Dependencies in $N$D

David W. Romero, David M. Knigge, Albert Gu et al.

The use of Convolutional Neural Networks (CNNs) is widespread in Deep Learning due to a range of desirable model properties which result in an efficient and effective machine learning framework. However, performant CNN architectures must be tailored to specific tasks in order to incorporate considerations such as the input length, resolution, and dimentionality. In this work, we overcome the need for problem-specific CNN architectures with our Continuous Convolutional Neural Network (CCNN): a single CNN architecture equipped with continuous convolutional kernels that can be used for tasks on data of arbitrary resolution, dimensionality and length without structural changes. Continuous convolutional kernels model long range dependencies at every layer, and remove the need for downsampling layers and task-dependent depths needed in current CNN architectures. We show the generality of our approach by applying the same CCNN to a wide set of tasks on sequential (1$\mathrm{D}$) and visual data (2$\mathrm{D}$). Our CCNN performs competitively and often outperforms the current state-of-the-art across all tasks considered.

LGMay 31, 2022
On Analyzing Generative and Denoising Capabilities of Diffusion-based Deep Generative Models

Kamil Deja, Anna Kuzina, Tomasz Trzciński et al.

Diffusion-based Deep Generative Models (DDGMs) offer state-of-the-art performance in generative modeling. Their main strength comes from their unique setup in which a model (the backward diffusion process) is trained to reverse the forward diffusion process, which gradually adds noise to the input signal. Although DDGMs are well studied, it is still unclear how the small amount of noise is transformed during the backward diffusion process. Here, we focus on analyzing this problem to gain more insight into the behavior of DDGMs and their denoising and generative capabilities. We observe a fluid transition point that changes the functionality of the backward diffusion process from generating a (corrupted) image from noise to denoising the corrupted image to the final sample. Based on this observation, we postulate to divide a DDGM into two parts: a denoiser and a generator. The denoiser could be parameterized by a denoising auto-encoder, while the generator is a diffusion-based model with its own set of parameters. We experimentally validate our proposition, showing its pros and cons.

LGJan 31, 2023
Learning Data Representations with Joint Diffusion Models

Kamil Deja, Tomasz Trzcinski, Jakub M. Tomczak

Joint machine learning models that allow synthesizing and classifying data often offer uneven performance between those tasks or are unstable to train. In this work, we depart from a set of empirical observations that indicate the usefulness of internal representations built by contemporary deep diffusion-based generative models not only for generating but also predicting. We then propose to extend the vanilla diffusion model with a classifier that allows for stable joint end-to-end training with shared parameterization between those objectives. The resulting joint diffusion model outperforms recent state-of-the-art hybrid methods in terms of both classification and generation quality on all evaluated benchmarks. On top of our joint training approach, we present how we can directly benefit from shared generative and discriminative representations by introducing a method for visual counterfactual explanations.

LGMar 27, 2023
Exploring Continual Learning of Diffusion Models

Michał Zając, Kamil Deja, Anna Kuzina et al.

Diffusion models have achieved remarkable success in generating high-quality images thanks to their novel training procedures applied to unprecedented amounts of data. However, training a diffusion model from scratch is computationally expensive. This highlights the need to investigate the possibility of training these models iteratively, reusing computation while the data distribution changes. In this study, we take the first step in this direction and evaluate the continual learning (CL) properties of diffusion models. We begin by benchmarking the most common CL methods applied to Denoising Diffusion Probabilistic Models (DDPMs), where we note the strong performance of the experience replay with the reduced rehearsal coefficient. Furthermore, we provide insights into the dynamics of forgetting, which exhibit diverse behavior across diffusion timesteps. We also uncover certain pitfalls of using the bits-per-dimension metric for evaluating CL.

LGNov 4, 2023
Mixed Models with Multiple Instance Learning

Jan P. Engelmann, Alessandro Palma, Jakub M. Tomczak et al.

Predicting patient features from single-cell data can help identify cellular states implicated in health and disease. Linear models and average cell type expressions are typically favored for this task for their efficiency and robustness, but they overlook the rich cell heterogeneity inherent in single-cell data. To address this gap, we introduce MixMIL, a framework integrating Generalized Linear Mixed Models (GLMM) and Multiple Instance Learning (MIL), upholding the advantages of linear models while modeling cell state heterogeneity. By leveraging predefined cell embeddings, MixMIL enhances computational efficiency and aligns with recent advancements in single-cell representation learning. Our empirical results reveal that MixMIL outperforms existing MIL models in single-cell datasets, uncovering new associations and elucidating biological mechanisms across different domains.

LGOct 3, 2023
De Novo Drug Design with Joint Transformers

Adam Izdebski, Ewelina Weglarz-Tomczak, Ewa Szczurek et al.

De novo drug design requires simultaneously generating novel molecules outside of training data and predicting their target properties, making it a hard task for generative models. To address this, we propose Joint Transformer that combines a Transformer decoder, Transformer encoder, and a predictor in a joint generative model with shared weights. We formulate a probabilistic black-box optimization algorithm that employs Joint Transformer to generate novel molecules with improved target properties and outperforms other SMILES-based optimization methods in de novo drug design.

LGFeb 20, 2023
Discouraging posterior collapse in hierarchical Variational Autoencoders using context

Anna Kuzina, Jakub M. Tomczak

Hierarchical Variational Autoencoders (VAEs) are among the most popular likelihood-based generative models. There is a consensus that the top-down hierarchical VAEs allow effective learning of deep latent structures and avoid problems like posterior collapse. Here, we show that this is not necessarily the case, and the problem of collapsing posteriors remains. To discourage this issue, we propose a deep hierarchical VAE with a context on top. Specifically, we use a Discrete Cosine Transform to obtain the last latent variable. In a series of experiments, we observe that the proposed modification allows us to achieve better utilization of the latent space and does not harm the model's generative abilities.

MLNov 4, 2025
Scalable Single-Cell Gene Expression Generation with Latent Diffusion Models

Giovanni Palla, Sudarshan Babu, Payam Dibaeinia et al.

Computational modeling of single-cell gene expression is crucial for understanding cellular processes, but generating realistic expression profiles remains a major challenge. This difficulty arises from the count nature of gene expression data and complex latent dependencies among genes. Existing generative models often impose artificial gene orderings or rely on shallow neural network architectures. We introduce a scalable latent diffusion model for single-cell gene expression data, which we refer to as scLDM, that respects the fundamental exchangeability property of the data. Our VAE uses fixed-size latent variables leveraging a unified Multi-head Cross-Attention Block (MCAB) architecture, which serves dual roles: permutation-invariant pooling in the encoder and permutation-equivariant unpooling in the decoder. We enhance this framework by replacing the Gaussian prior with a latent diffusion model using Diffusion Transformers and linear interpolants, enabling high-quality generation with multi-conditional classifier-free guidance. We show its superior performance in a variety of experiments for both observational and perturbational single-cell data, as well as downstream tasks like cell-level classification.

MLApr 1, 2021Code
Storchastic: A Framework for General Stochastic Automatic Differentiation

Emile van Krieken, Jakub M. Tomczak, Annette ten Teije

Modelers use automatic differentiation (AD) of computation graphs to implement complex Deep Learning models without defining gradient computations. Stochastic AD extends AD to stochastic computation graphs with sampling steps, which arise when modelers handle the intractable expectations common in Reinforcement Learning and Variational Inference. However, current methods for stochastic AD are limited: They are either only applicable to continuous random variables and differentiable functions, or can only use simple but high variance score-function estimators. To overcome these limitations, we introduce Storchastic, a new framework for AD of stochastic computation graphs. Storchastic allows the modeler to choose from a wide variety of gradient estimation methods at each sampling step, to optimally reduce the variance of the gradient estimates. Furthermore, Storchastic is provably unbiased for estimation of any-order gradients, and generalizes variance reduction techniques to higher-order gradient estimates. Finally, we implement Storchastic as a PyTorch library at https://github.com/HEmile/storchastic.

LGJun 9, 2020Code
Wavelet Networks: Scale-Translation Equivariant Learning From Raw Time-Series

David W. Romero, Erik J. Bekkers, Jakub M. Tomczak et al.

Leveraging the symmetries inherent to specific data domains for the construction of equivariant neural networks has lead to remarkable improvements in terms of data efficiency and generalization. However, most existing research focuses on symmetries arising from planar and volumetric data, leaving a crucial data source largely underexplored: time-series. In this work, we fill this gap by leveraging the symmetries inherent to time-series for the construction of equivariant neural network. We identify two core symmetries: *scale and translation*, and construct scale-translation equivariant neural networks for time-series learning. Intriguingly, we find that scale-translation equivariant mappings share strong resemblance with the wavelet transform. Inspired by this resemblance, we term our networks Wavelet Networks, and show that they perform nested non-linear wavelet-like time-frequency transforms. Empirical results show that Wavelet Networks outperform conventional CNNs on raw waveforms, and match strongly engineered spectrogram techniques across several tasks and time-series types, including audio, environmental sounds, and electrical signals. Our code is publicly available at https://github.com/dwromero/wavelet_networks.

MLApr 3, 2018Code
Hyperspherical Variational Auto-Encoders

Tim R. Davidson, Luca Falorsi, Nicola De Cao et al.

The Variational Auto-Encoder (VAE) is one of the most used unsupervised machine learning models. But although the default choice of a Gaussian distribution for both the prior and posterior represents a mathematically convenient distribution often leading to competitive results, we show that this parameterization fails to model data with a latent hyperspherical structure. To address this issue we propose using a von Mises-Fisher (vMF) distribution instead, leading to a hyperspherical latent space. Through a series of experiments we show how such a hyperspherical VAE, or $\mathcal{S}$-VAE, is more suitable for capturing data with a hyperspherical latent structure, while outperforming a normal, $\mathcal{N}$-VAE, in low dimensions on other data types. Code at http://github.com/nicola-decao/s-vae-tf and https://github.com/nicola-decao/s-vae-pytorch

LGDec 2, 2024
Hierarchical VAE with a Diffusion-based VampPrior

Anna Kuzina, Jakub M. Tomczak

Deep hierarchical variational autoencoders (VAEs) are powerful latent variable generative models. In this paper, we introduce Hierarchical VAE with Diffusion-based Variational Mixture of the Posterior Prior (VampPrior). We apply amortization to scale the VampPrior to models with many stochastic layers. The proposed approach allows us to achieve better performance compared to the original VampPrior work and other deep hierarchical VAEs, while using fewer parameters. We empirically validate our method on standard benchmark datasets (MNIST, OMNIGLOT, CIFAR10) and demonstrate improved training stability and latent space utilization.

LGApr 11, 2025
Knowledge Graph-extended Retrieval Augmented Generation for Question Answering

Jasper Linders, Jakub M. Tomczak

Large Language Models (LLMs) and Knowledge Graphs (KGs) offer a promising approach to robust and explainable Question Answering (QA). While LLMs excel at natural language understanding, they suffer from knowledge gaps and hallucinations. KGs provide structured knowledge but lack natural language interaction. Ideally, an AI system should be both robust to missing facts as well as easy to communicate with. This paper proposes such a system that integrates LLMs and KGs without requiring training, ensuring adaptability across different KGs with minimal human effort. The resulting approach can be classified as a specific form of a Retrieval Augmented Generation (RAG) with a KG, thus, it is dubbed Knowledge Graph-extended Retrieval Augmented Generation (KG-RAG). It includes a question decomposition module to enhance multi-hop information retrieval and answer explainability. Using In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting, it generates explicit reasoning chains processed separately to improve truthfulness. Experiments on the MetaQA benchmark show increased accuracy for multi-hop questions, though with a slight trade-off in single-hop performance compared to LLM with KG baselines. These findings demonstrate KG-RAG's potential to improve transparency in QA by bridging unstructured language understanding with structured knowledge retrieval.

LGApr 23, 2025
Synergistic Benefits of Joint Molecule Generation and Property Prediction

Adam Izdebski, Jan Olszewski, Pankhil Gawade et al.

Modeling the joint distribution of data samples and their properties allows to construct a single model for both data generation and property prediction, with synergistic benefits reaching beyond purely generative or predictive models. However, training joint models presents daunting architectural and optimization challenges. Here, we propose Hyformer, a transformer-based joint model that successfully blends the generative and predictive functionalities, using an alternating attention mechanism and a joint pre-training scheme. We show that Hyformer is simultaneously optimized for molecule generation and property prediction, while exhibiting synergistic benefits in conditional sampling, out-of-distribution property prediction and representation learning. Finally, we demonstrate the benefits of joint learning in a drug design use case of discovering novel antimicrobial~peptides.

CLJun 25, 2024
Generative AI Systems: A Systems-based Perspective on Generative AI

Jakub M. Tomczak

Large Language Models (LLMs) have revolutionized AI systems by enabling communication with machines using natural language. Recent developments in Generative AI (GenAI) like Vision-Language Models (GPT-4V) and Gemini have shown great promise in using LLMs as multimodal systems. This new research line results in building Generative AI systems, GenAISys for short, that are capable of multimodal processing and content creation, as well as decision-making. GenAISys use natural language as a communication means and modality encoders as I/O interfaces for processing various data sources. They are also equipped with databases and external specialized tools, communicating with the system through a module for information retrieval and storage. This paper aims to explore and state new research directions in Generative AI Systems, including how to design GenAISys (compositionality, reliability, verifiability), build and train them, and what can be learned from the system-based perspective. Cross-disciplinary approaches are needed to answer open questions about the inner workings of GenAI systems.

RONov 18, 2021
The Effects of Learning in Morphologically Evolving Robot Systems

Jie Luo, Aart Stuurman, Jakub M. Tomczak et al.

Simultaneously evolving morphologies (bodies) and controllers (brains) of robots can cause a mismatch between the inherited body and brain in the offspring. To mitigate this problem, the addition of an infant learning period by the so-called Triangle of Life framework has been proposed relatively long ago. However, an empirical assessment is still lacking to-date. In this paper we investigate the effects of such a learning mechanism from different perspectives. Using extensive simulations we show that learning can greatly increase task performance and reduce the number of generations required to reach a certain fitness level compared to the purely evolutionary approach. Furthermore, although learning only directly affects the controllers, we demonstrate that the evolved morphologies will be also different. This provides a quantitative demonstration that changes in the brain can induce changes in the body. Finally, we examine the concept of morphological intelligence quantified by the ability of a given body to learn. We observe that the learning delta, the performance difference between the inherited and the learned brain, is growing throughout the evolutionary process. This shows that evolution is producing robots with an increasing plasticity, that is, consecutive generations are becoming better and better learners which in turn makes them better and better at the given task. All in all, our results demonstrate that the Triangle of Life is not only a concept of theoretical interest, but a system architecture with practical benefits.

CVOct 15, 2021
FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes

David W. Romero, Robert-Jan Bruintjes, Jakub M. Tomczak et al.

When designing Convolutional Neural Networks (CNNs), one must select the size\break of the convolutional kernels before training. Recent works show CNNs benefit from different kernel sizes at different layers, but exploring all possible combinations is unfeasible in practice. A more efficient approach is to learn the kernel size during training. However, existing works that learn the kernel size have a limited bandwidth. These approaches scale kernels by dilation, and thus the detail they can describe is limited. In this work, we propose FlexConv, a novel convolutional operation with which high bandwidth convolutional kernels of learnable kernel size can be learned at a fixed parameter cost. FlexNets model long-term dependencies without the use of pooling, achieve state-of-the-art performance on several sequential datasets, outperform recent works with learned kernel sizes, and are competitive with much deeper ResNets on image benchmark datasets. Additionally, FlexNets can be deployed at higher resolutions than those seen during training. To avoid aliasing, we propose a novel kernel parameterization with which the frequency of the kernels can be analytically controlled. Our novel kernel parameterization shows higher descriptive power and faster convergence speed than existing parameterizations. This leads to important improvements in classification accuracy.

NESep 22, 2021
Training Deep Spiking Auto-encoders without Bursting or Dying Neurons through Regularization

Justus F. Hübotter, Pablo Lanillos, Jakub M. Tomczak

Spiking neural networks are a promising approach towards next-generation models of the brain in computational neuroscience. Moreover, compared to classic artificial neural networks, they could serve as an energy-efficient deployment of AI by enabling fast computation in specialized neuromorphic hardware. However, training deep spiking neural networks, especially in an unsupervised manner, is challenging and the performance of a spiking model is significantly hindered by dead or bursting neurons. Here, we apply end-to-end learning with membrane potential-based backpropagation to a spiking convolutional auto-encoder with multiple trainable layers of leaky integrate-and-fire neurons. We propose bio-inspired regularization methods to control the spike density in latent representations. In the experiments, we show that applying regularization on membrane potential and spiking output successfully avoids both dead and bursting neurons and significantly decreases the reconstruction error of the spiking auto-encoder. Training regularized networks on the MNIST dataset yields image reconstruction quality comparable to non-spiking baseline models (deterministic and variational auto-encoder) and indicates improvement upon earlier approaches. Importantly, we show that, unlike the variational auto-encoder, the spiking latent representations display structure associated with the image class.

CRMar 10, 2021
Diagnosing Vulnerability of Variational Auto-Encoders to Adversarial Attacks

Anna Kuzina, Max Welling, Jakub M. Tomczak

In this work, we explore adversarial attacks on the Variational Autoencoders (VAE). We show how to modify data point to obtain a prescribed latent code (supervised attack) or just get a drastically different code (unsupervised attack). We examine the influence of model modifications ($β$-VAE, NVAE) on the robustness of VAEs and suggest metrics to quantify it.

MLFeb 4, 2021
Invertible DenseNets with Concatenated LipSwish

Yura Perugachi-Diaz, Jakub M. Tomczak, Sandjai Bhulai

We introduce Invertible Dense Networks (i-DenseNets), a more parameter efficient extension of Residual Flows. The method relies on an analysis of the Lipschitz continuity of the concatenation in DenseNets, where we enforce invertibility of the network by satisfying the Lipschitz constant. Furthermore, we propose a learnable weighted concatenation, which not only improves the model performance but also indicates the importance of the concatenated weighted representation. Additionally, we introduce the Concatenated LipSwish as activation function, for which we show how to enforce the Lipschitz condition and which boosts performance. The new architecture, i-DenseNet, out-performs Residual Flow and other flow-based models on density estimation evaluated in bits per dimension, where we utilize an equal parameter budget. Moreover, we show that the proposed model out-performs Residual Flows when trained as a hybrid model where the model is both a generative and a discriminative model.

LGFeb 4, 2021
CKConv: Continuous Kernel Convolution For Sequential Data

David W. Romero, Anna Kuzina, Erik J. Bekkers et al.

Conventional neural architectures for sequential data present important limitations. Recurrent networks suffer from exploding and vanishing gradients, small effective memory horizons, and must be trained sequentially. Convolutional networks are unable to handle sequences of unknown size and their memory horizon must be defined a priori. In this work, we show that all these problems can be solved by formulating convolutional kernels in CNNs as continuous functions. The resulting Continuous Kernel Convolution (CKConv) allows us to model arbitrarily long sequences in a parallel manner, within a single operation, and without relying on any form of recurrence. We show that Continuous Kernel Convolutional Networks (CKCNNs) obtain state-of-the-art results in multiple datasets, e.g., permuted MNIST, and, thanks to their continuous nature, are able to handle non-uniformly sampled datasets and irregularly-sampled data natively. CKCNNs match or perform better than neural ODEs designed for these purposes in a faster and simpler manner.

LGNov 30, 2020
General Invertible Transformations for Flow-based Generative Modeling

Jakub M. Tomczak

In this paper, we present a new class of invertible transformations with an application to flow-based generative models. We indicate that many well-known invertible transformations in reversible logic and reversible neural networks could be derived from our proposition. Next, we propose two new coupling layers that are important building blocks of flow-based generative models. In the experiments on digit data, we present how these new coupling layers could be used in Integer Discrete Flows (IDF), and that they achieve better results than standard coupling layers used in IDF and RealNVP.

MLOct 19, 2020
ABC-Di: Approximate Bayesian Computation for Discrete Data

Ilze Amanda Auzina, Jakub M. Tomczak

Many real-life problems are represented as a black-box, i.e., the internal workings are inaccessible or a closed-form mathematical expression of the likelihood function cannot be defined. For continuous random variables likelihood-free inference problems can be solved by a group of methods under the name of Approximate Bayesian Computation (ABC). However, a similar approach for discrete random variables is yet to be formulated. Here, we aim to fill this research gap. We propose to use a population-based MCMC ABC framework. Further, we present a valid Markov kernel, and propose a new kernel that is inspired by Differential Evolution. We assess the proposed approach on a problem with the known likelihood function, namely, discovering the underlying diseases based on a QMR-DT Network, and three likelihood-free inference problems: (i) the QMR-DT Network with the unknown likelihood function, (ii) learning binary neural network, and (iii) Neural Architecture Search. The obtained results indicate the high potential of the proposed framework and the superiority of the new Markov kernel.

AIOct 19, 2020
Learning Locomotion Skills in Evolvable Robots

Gongjin Lan, Maarten van Hooft, Matteo De Carlo et al.

The challenge of robotic reproduction -- making of new robots by recombining two existing ones -- has been recently cracked and physically evolving robot systems have come within reach. Here we address the next big hurdle: producing an adequate brain for a newborn robot. In particular, we address the task of targeted locomotion which is arguably a fundamental skill in any practical implementation. We introduce a controller architecture and a generic learning method to allow a modular robot with an arbitrary shape to learn to walk towards a target and follow this target if it moves. Our approach is validated on three robots, a spider, a gecko, and their offspring, in three real-world scenarios.

LGOct 5, 2020
Invertible DenseNets

Yura Perugachi-Diaz, Jakub M. Tomczak, Sandjai Bhulai

We introduce Invertible Dense Networks (i-DenseNets), a more parameter efficient alternative to Residual Flows. The method relies on an analysis of the Lipschitz continuity of the concatenation in DenseNets, where we enforce the invertibility of the network by satisfying the Lipschitz constraint. Additionally, we extend this method by proposing a learnable concatenation, which not only improves the model performance but also indicates the importance of the concatenated representation. We demonstrate the performance of i-DenseNets and Residual Flows on toy, MNIST, and CIFAR10 data. Both i-DenseNets outperform Residual Flows evaluated in negative log-likelihood, on all considered datasets under an equal parameter budget.

MLOct 5, 2020
Self-Supervised Variational Auto-Encoders

Ioannis Gatopoulos, Jakub M. Tomczak

Density estimation, compression and data generation are crucial tasks in artificial intelligence. Variational Auto-Encoders (VAEs) constitute a single framework to achieve these goals. Here, we present a novel class of generative models, called self-supervised Variational Auto-Encoder (selfVAE), that utilizes deterministic and discrete variational posteriors. This class of models allows to perform both conditional and unconditional sampling, while simplifying the objective function. First, we use a single self-supervised transformation as a latent variable, where a transformation is either downscaling or edge detection. Next, we consider a hierarchical architecture, i.e., multiple transformations, and we show its benefits compared to the VAE. The flexibility of selfVAE in data reconstruction finds a particularly interesting use case in data compression tasks, where we can trade-off memory for better data quality, and vice-versa. We present performance of our approach on three benchmark image data (Cifar10, Imagenette64, and CelebA).

BMSep 19, 2020
Population-based Optimization for Kinetic Parameter Identification in Glycolytic Pathway in Saccharomyces cerevisiae

Ewelina Weglarz-Tomczak, Jakub M. Tomczak, Agoston E. Eiben et al.

Models in systems biology are mathematical descriptions of biological processes that are used to answer questions and gain a better understanding of biological phenomena. Dynamic models represent the network through rates of the production and consumption for the individual species. The ordinary differential equations that describe rates of the reactions in the model include a set of parameters. The parameters are important quantities to understand and analyze biological systems. Moreover, the perturbation of the kinetic parameters are correlated with upregulation of the system by cell-intrinsic and cell-extrinsic factors, including mutations and the environment changes. Here, we aim at using well-established models of biological pathways to identify parameter values and point their potential perturbation/deviation. We present our population-based optimization framework that is able to identify kinetic parameters in the dynamic model based on only input and output data (i.e., timecourses of selected metabolites). Our approach can deal with the identification of the non-measurable parameters as well as with discovering deviation of the parameters. We present our proposed optimization framework on the example of the well-studied glycolytic pathway in Saccharomyces cerevisiae.

LGJun 9, 2020
Super-resolution Variational Auto-Encoders

Ioannis Gatopoulos, Maarten Stol, Jakub M. Tomczak

The framework of variational autoencoders (VAEs) provides a principled method for jointly learning latent-variable models and corresponding inference models. However, the main drawback of this approach is the blurriness of the generated images. Some studies link this effect to the objective function, namely, the (negative) log-likelihood. Here, we propose to enhance VAEs by adding a random variable that is a downscaled version of the original image and still use the log-likelihood function as the learning objective. Further, by providing the downscaled image as an input to the decoder, it can be used in a manner similar to the super-resolution. We present empirically that the proposed approach performs comparably to VAEs in terms of the negative log-likelihood, but it obtains a better FID score in data synthesis.

LGJun 2, 2020
The Convolution Exponential and Generalized Sylvester Flows

Emiel Hoogeboom, Victor Garcia Satorras, Jakub M. Tomczak et al.

This paper introduces a new method to build linear flows, by taking the exponential of a linear transformation. This linear transformation does not need to be invertible itself, and the exponential has the following desirable properties: it is guaranteed to be invertible, its inverse is straightforward to compute and the log Jacobian determinant is equal to the trace of the linear transformation. An important insight is that the exponential can be computed implicitly, which allows the use of convolutional layers. Using this insight, we develop new invertible transformations named convolution exponentials and graph convolution exponentials, which retain the equivariance of their underlying transformations. In addition, we generalize Sylvester Flows and propose Convolutional Sylvester Flows which are based on the generalization and the convolution exponential as basis change. Empirically, we show that the convolution exponential outperforms other linear transformations in generative flows on CIFAR10 and the graph convolution exponential improves the performance of graph normalizing flows. In addition, we show that Convolutional Sylvester Flows improve performance over residual flows as a generative flow model measured in log-likelihood.

MLMay 4, 2020
Selecting Data Augmentation for Simulating Interventions

Maximilian Ilse, Jakub M. Tomczak, Patrick Forré

Machine learning models trained with purely observational data and the principle of empirical risk minimization \citep{vapnik_principles_1992} can fail to generalize to unseen domains. In this paper, we focus on the case where the problem arises through spurious correlation between the observed domains and the actual task labels. We find that many domain generalization methods do not explicitly take this spurious correlation into account. Instead, especially in more application-oriented research areas like medical imaging or robotics, data augmentation techniques that are based on heuristics are used to learn domain invariant features. To bridge the gap between theory and practice, we develop a causal perspective on the problem of domain generalization. We argue that causal concepts can be used to explain the success of data augmentation by describing how they can weaken the spurious correlation between the observed domains and the task labels. We demonstrate that data augmentation can serve as a tool for simulating interventional data. We use these theoretical insights to derive a simple algorithm that is able to select data augmentation techniques that will lead to better domain generalization.

NEMay 4, 2020
Time Efficiency in Optimization with a Bayesian-Evolutionary Algorithm

Gongjin Lan, Jakub M. Tomczak, Diederik M. Roijers et al.

Not all generate-and-test search algorithms are created equal. Bayesian Optimization (BO) invests a lot of computation time to generate the candidate solution that best balances the predicted value and the uncertainty given all previous data, taking increasingly more time as the number of evaluations performed grows. Evolutionary Algorithms (EA) on the other hand rely on search heuristics that typically do not depend on all previous data and can be done in constant time. Both the BO and EA community typically assess their performance as a function of the number of evaluations. However, this is unfair once we start to compare the efficiency of these classes of algorithms, as the overhead times to generate candidate solutions are significantly different. We suggest to measure the efficiency of generate-and-test search algorithms as the expected gain in the objective value per unit of computation time spent. We observe that the preference of an algorithm to be used can change after a number of function evaluations. We therefore propose a new algorithm, a combination of Bayesian optimization and an Evolutionary Algorithm, BEA for short, that starts with BO, then transfers knowledge to an EA, and subsequently runs the EA. We compare the BEA with BO and the EA. The results show that BEA outperforms both BO and the EA in terms of time efficiency, and ultimately leads to better performance on well-known benchmark objective functions with many local optima. Moreover, we test the three algorithms on nine test cases of robot learning problems and here again we find that BEA outperforms the other algorithms.

NEFeb 7, 2020
Differential Evolution with Reversible Linear Transformations

Jakub M. Tomczak, Ewelina Weglarz-Tomczak, Agoston E. Eiben

Differential evolution (DE) is a well-known type of evolutionary algorithms (EA). Similarly to other EA variants it can suffer from small populations and loose diversity too quickly. This paper presents a new approach to mitigate this issue: We propose to generate new candidate solutions by utilizing reversible linear transformation applied to a triplet of solutions from the population. In other words, the population is enlarged by using newly generated individuals without evaluating their fitness. We assess our methods on three problems: (i) benchmark function optimization, (ii) discovering parameter values of the gene repressilator system, (iii) learning neural networks. The empirical results indicate that the proposed approach outperforms vanilla DE and a version of DE with applying differential mutation three times on all testbeds.

CVFeb 7, 2020
Attentive Group Equivariant Convolutional Networks

David W. Romero, Erik J. Bekkers, Jakub M. Tomczak et al.

Although group convolutional networks are able to learn powerful representations based on symmetry patterns, they lack explicit means to learn meaningful relationships among them (e.g., relative positions and poses). In this paper, we present attentive group equivariant convolutions, a generalization of the group convolution, in which attention is applied during the course of convolution to accentuate meaningful symmetry combinations and suppress non-plausible, misleading ones. We indicate that prior work on visual attention can be described as special cases of our proposed framework and show empirically that our attentive group equivariant convolutional networks consistently outperform conventional group convolutional networks on benchmark image datasets. Simultaneously, we provide interpretability to the learned concepts through the visualization of equivariant attention maps.

LGJan 30, 2020
Learning Discrete Distributions by Dequantization

Emiel Hoogeboom, Taco S. Cohen, Jakub M. Tomczak

Media is generally stored digitally and is therefore discrete. Many successful deep distribution models in deep learning learn a density, i.e., the distribution of a continuous random variable. Naïve optimization on discrete data leads to arbitrarily high likelihoods, and instead, it has become standard practice to add noise to datapoints. In this paper, we present a general framework for dequantization that captures existing methods as a special case. We derive two new dequantization objectives: importance-weighted (iw) dequantization and Rényi dequantization. In addition, we introduce autoregressive dequantization (ARD) for more flexible dequantization distributions. Empirically we find that iw and Rényi dequantization considerably improve performance for uniform dequantization distributions. ARD achieves a negative log-likelihood of 3.06 bits per dimension on CIFAR10, which to the best of our knowledge is state-of-the-art among distribution models that do not require autoregressive inverses for sampling.

NEJan 21, 2020
Learning Directed Locomotion in Modular Robots with Evolvable Morphologies

Gongjin Lan, Matteo De Carlo, Fuda van Diggelen et al.

We generalize the well-studied problem of gait learning in modular robots in two dimensions. Firstly, we address locomotion in a given target direction that goes beyond learning a typical undirected gait. Secondly, rather than studying one fixed robot morphology we consider a test suite of different modular robots. This study is based on our interest in evolutionary robot systems where both morphologies and controllers evolve. In such a system, newborn robots have to learn to control their own body that is a random combination of the bodies of the parents. We apply and compare two learning algorithms, Bayesian optimization and HyperNEAT. The results of the experiments in simulation show that both methods successfully learn good controllers, but Bayesian optimization is more effective and efficient. We validate the best learned controllers by constructing three robots from the test suite in the real world and observe their fitness and actual trajectories. The obtained results indicate a reality gap that depends on the controllers and the shape of the robots, but overall the trajectories are adequate and follow the target directions successfully.

MLOct 7, 2019
Increasing Expressivity of a Hyperspherical VAE

Tim R. Davidson, Jakub M. Tomczak, Efstratios Gavves

Learning suitable latent representations for observed, high-dimensional data is an important research topic underlying many recent advances in machine learning. While traditionally the Gaussian normal distribution has been the go-to latent parameterization, recently a variety of works have successfully proposed the use of manifold-valued latents. In one such work (Davidson et al., 2018), the authors empirically show the potential benefits of using a hyperspherical von Mises-Fisher (vMF) distribution in low dimensionality. However, due to the unique distributional form of the vMF, expressivity in higher dimensional space is limited as a result of its scalar concentration parameter leading to a 'hyperspherical bottleneck'. In this work we propose to extend the usability of hyperspherical parameterizations to higher dimensions using a product-space instead, showing improved results on a selection of image datasets.

IVAug 14, 2019
Video Compression With Rate-Distortion Autoencoders

Amirhossein Habibian, Ties van Rozendaal, Jakub M. Tomczak et al.

In this paper we present a a deep generative model for lossy video compression. We employ a model that consists of a 3D autoencoder with a discrete latent space and an autoregressive prior used for entropy coding. Both autoencoder and prior are trained jointly to minimize a rate-distortion loss, which is closely related to the ELBO used in variational autoencoders. Despite its simplicity, we find that our method outperforms the state-of-the-art learned video compression networks based on motion compensation or interpolation. We systematically evaluate various design choices, such as the use of frame-based or spatio-temporal autoencoders, and the type of autoregressive prior. In addition, we present three extensions of the basic method that demonstrate the benefits over classical approaches to compression. First, we introduce semantic compression, where the model is trained to allocate more bits to objects of interest. Second, we study adaptive compression, where the model is adapted to a domain with limited variability, e.g., videos taken from an autonomous car, to achieve superior compression on that domain. Finally, we introduce multimodal compression, where we demonstrate the effectiveness of our model in joint compression of multiple modalities captured by non-standard imaging sensors, such as quad cameras. We believe that this opens up novel video compression applications, which have not been feasible with classical codecs.

MLMay 24, 2019
DIVA: Domain Invariant Variational Autoencoders

Maximilian Ilse, Jakub M. Tomczak, Christos Louizos et al.

We consider the problem of domain generalization, namely, how to learn representations given data from a set of domains that generalize to data from a previously unseen domain. We propose the Domain Invariant Variational Autoencoder (DIVA), a generative model that tackles this problem by learning three independent latent subspaces, one for the domain, one for the class, and one for any residual variations. We highlight that due to the generative nature of our model we can also incorporate unlabeled data from known or previously unseen domains. To the best of our knowledge this has not been done before in a domain generalization setting. This property is highly desirable in fields like medical imaging where labeled data is scarce. We experimentally evaluate our model on the rotated MNIST benchmark and a malaria cell images dataset where we show that (i) the learned subspaces are indeed complementary to each other, (ii) we improve upon recent works on this task and (iii) incorporating unlabelled data can boost the performance even further.

MLApr 26, 2019
Simulating Execution Time of Tensor Programs using Graph Neural Networks

Jakub M. Tomczak, Romain Lepert, Auke Wiggers

Optimizing the execution time of tensor program, e.g., a convolution, involves finding its optimal configuration. Searching the configuration space exhaustively is typically infeasible in practice. In line with recent research using TVM, we propose to learn a surrogate model to overcome this issue. The model is trained on an acyclic graph called an abstract syntax tree, and utilizes a graph convolutional network to exploit structure in the graph. We claim that a learnable graph-based data processing is a strong competitor to heuristic-based feature extraction. We present a new dataset of graphs corresponding to configurations and their execution time for various tensor programs. We provide baselines for a runtime prediction task.

MLFeb 1, 2019
Combinatorial Bayesian Optimization using the Graph Cartesian Product

Changyong Oh, Jakub M. Tomczak, Efstratios Gavves et al.

This paper focuses on Bayesian Optimization (BO) for objectives on combinatorial search spaces, including ordinal and categorical variables. Despite the abundance of potential applications of Combinatorial BO, including chipset configuration search and neural architecture search, only a handful of methods have been proposed. We introduce COMBO, a new Gaussian Process (GP) BO. COMBO quantifies "smoothness" of functions on combinatorial search spaces by utilizing a combinatorial graph. The vertex set of the combinatorial graph consists of all possible joint assignments of the variables, while edges are constructed using the graph Cartesian product of the sub-graphs that represent the individual variables. On this combinatorial graph, we propose an ARD diffusion kernel with which the GP is able to model high-order interactions between variables leading to better performance. Moreover, using the Horseshoe prior for the scale parameter in the ARD diffusion kernel results in an effective variable selection procedure, making COMBO suitable for high dimensional problems. Computationally, in COMBO the graph Cartesian product allows the Graph Fourier Transform calculation to scale linearly instead of exponentially. We validate COMBO in a wide array of realistic benchmarks, including weighted maximum satisfiability problems and neural architecture search. COMBO outperforms consistently the latest state-of-the-art while maintaining computational and statistical efficiency.

MLJun 26, 2018
Hierarchical VampPrior Variational Fair Auto-Encoder

Philip Botros, Jakub M. Tomczak

Decision making is a process that is extremely prone to different biases. In this paper we consider learning fair representations that aim at removing nuisance (sensitive) information from the decision process. For this purpose, we propose to use deep generative modeling and adapt a hierarchical Variational Auto-Encoder to learn these fair representations. Moreover, we utilize the mutual information as a useful regularizer for enforcing fairness of a representation. In experiments on two benchmark datasets and two scenarios where the sensitive variables are fully and partially observable, we show that the proposed approach either outperforms or performs on par with the current best model.

MLMar 15, 2018
Sylvester Normalizing Flows for Variational Inference

Rianne van den Berg, Leonard Hasenclever, Jakub M. Tomczak et al.

Variational inference relies on flexible approximate posterior distributions. Normalizing flows provide a general recipe to construct flexible variational posteriors. We introduce Sylvester normalizing flows, which can be seen as a generalization of planar flows. Sylvester normalizing flows remove the well-known single-unit bottleneck from planar flows, making a single transformation much more flexible. We compare the performance of Sylvester normalizing flows against planar flows and inverse autoregressive flows and demonstrate that they compare favorably on several datasets.

LGFeb 13, 2018
Attention-based Deep Multiple Instance Learning

Maximilian Ilse, Jakub M. Tomczak, Max Welling

Multiple instance learning (MIL) is a variation of supervised learning where a single class label is assigned to a bag of instances. In this paper, we state the MIL problem as learning the Bernoulli distribution of the bag label where the bag label probability is fully parameterized by neural networks. Furthermore, we propose a neural network-based permutation-invariant aggregation operator that corresponds to the attention mechanism. Notably, an application of the proposed attention-based operator provides insight into the contribution of each instance to the bag label. We show empirically that our approach achieves comparable performance to the best MIL methods on benchmark MIL datasets and it outperforms other methods on a MNIST-based MIL dataset and two real-life histopathology datasets without sacrificing interpretability.

LGDec 1, 2017
Deep Learning with Permutation-invariant Operator for Multi-instance Histopathology Classification

Jakub M. Tomczak, Maximilian Ilse, Max Welling

The computer-aided analysis of medical scans is a longstanding goal in the medical imaging field. Currently, deep learning has became a dominant methodology for supporting pathologists and radiologist. Deep learning algorithms have been successfully applied to digital pathology and radiology, nevertheless, there are still practical issues that prevent these tools to be widely used in practice. The main obstacles are low number of available cases and large size of images (a.k.a. the small n, large p problem in machine learning), and a very limited access to annotation at a pixel level that can lead to severe overfitting and large computational requirements. We propose to handle these issues by introducing a framework that processes a medical image as a collection of small patches using a single, shared neural network. The final diagnosis is provided by combining scores of individual patches using a permutation-invariant operator (combination). In machine learning community such approach is called a multi-instance learning (MIL).

MLJun 7, 2017
Improving Variational Auto-Encoders using convex combination linear Inverse Autoregressive Flow

Jakub M. Tomczak, Max Welling

In this paper, we propose a new volume-preserving flow and show that it performs similarly to the linear general normalizing flow. The idea is to enrich a linear Inverse Autoregressive Flow by introducing multiple lower-triangular matrices with ones on the diagonal and combining them using a convex combination. In the experimental studies on MNIST and Histopathology data we show that the proposed approach outperforms other volume-preserving flows and is competitive with current state-of-the-art linear normalizing flow.

LGMay 19, 2017
VAE with a VampPrior

Jakub M. Tomczak, Max Welling

Many different methods to train deep generative models have been introduced in the past. In this paper, we propose to extend the variational auto-encoder (VAE) framework with a new type of prior which we call "Variational Mixture of Posteriors" prior, or VampPrior for short. The VampPrior consists of a mixture distribution (e.g., a mixture of Gaussians) with components given by variational posteriors conditioned on learnable pseudo-inputs. We further extend this prior to a two layer hierarchical model and show that this architecture with a coupled prior and posterior, learns significantly better models. The model also avoids the usual local optima issues related to useless latent dimensions that plague VAEs. We provide empirical studies on six datasets, namely, static and binary MNIST, OMNIGLOT, Caltech 101 Silhouettes, Frey Faces and Histopathology patches, and show that applying the hierarchical VampPrior delivers state-of-the-art results on all datasets in the unsupervised permutation invariant setting and the best results or comparable to SOTA methods for the approach with convolutional networks.

LGNov 29, 2016
Improving Variational Auto-Encoders using Householder Flow

Jakub M. Tomczak, Max Welling

Variational auto-encoders (VAE) are scalable and powerful generative models. However, the choice of the variational posterior determines tractability and flexibility of the VAE. Commonly, latent variables are modeled using the normal distribution with a diagonal covariance matrix. This results in computational efficiency but typically it is not flexible enough to match the true posterior distribution. One fashion of enriching the variational posterior distribution is application of normalizing flows, i.e., a series of invertible transformations to latent variables with a simple posterior. In this paper, we follow this line of thinking and propose a volume-preserving flow that uses a series of Householder transformations. We show empirically on MNIST dataset and histopathology data that the proposed flow allows to obtain more flexible variational posterior and competitive results comparing to other normalizing flows.