Daniel Wesego

LG
h-index1
4papers
8citations
Novelty51%
AI Score41

4 Papers

LGAug 29, 2024
Multimodal ELBO with Diffusion Decoders

Daniel Wesego, Pedram Rooshenas

Multimodal variational autoencoders have demonstrated their ability to learn the relationships between different modalities by mapping them into a latent representation. Their design and capacity to perform any-to-any conditional and unconditional generation make them appealing. However, different variants of multimodal VAEs often suffer from generating low-quality output, particularly when complex modalities such as images are involved. In addition to that, they frequently exhibit low coherence among the generated modalities when sampling from the joint distribution. To address these limitations, we propose a new variant of the multimodal VAE ELBO that incorporates a better decoder using a diffusion generative model. The diffusion decoder enables the model to learn complex modalities and generate high-quality outputs. The multimodal model can also seamlessly integrate with a standard feed-forward decoder for different types of modality, facilitating end-to-end training and inference. Furthermore, we introduce an auxiliary score-based model to enhance the unconditional generation capabilities of our proposed approach. This approach addresses the limitations imposed by conventional multimodal VAEs and opens up new possibilities to improve multimodal generation tasks. Our model provides state-of-the-art results compared to other multimodal VAEs in different datasets with higher coherence and superior quality in the generated modalities.

LGJan 22, 2025Code
Graph Representation Learning with Diffusion Generative Models

Daniel Wesego

Diffusion models have established themselves as state-of-the-art generative models across various data modalities, including images and videos, due to their ability to accurately approximate complex data distributions. Unlike traditional generative approaches such as VAEs and GANs, diffusion models employ a progressive denoising process that transforms noise into meaningful data over multiple iterative steps. This gradual approach enhances their expressiveness and generation quality. Not only that, diffusion models have also been shown to extract meaningful representations from data while learning to generate samples. Despite their success, the application of diffusion models to graph-structured data remains relatively unexplored, primarily due to the discrete nature of graphs, which necessitates discrete diffusion processes distinct from the continuous methods used in other domains. In this work, we leverage the representational capabilities of diffusion models to learn meaningful embeddings for graph data. By training a discrete diffusion model within an autoencoder framework, we enable both effective autoencoding and representation learning tailored to the unique characteristics of graph-structured data. We extract the representation from the combination of the encoder's output and the decoder's first time step hidden embedding. Our approach demonstrates the potential of discrete diffusion models to be used for graph representation learning. The code can be found at https://github.com/DanielMitiku/Graph-Representation-Learning-with-Diffusion-Generative-Models

43.0LGMay 8
TARO: Temporal Adversarial Rectification Optimization Using Diffusion Models as Purifiers

Daniel Wesego, Pedram Rooshenas

Adversarial purification with diffusion models seeks to project adversarial examples back toward the data manifold, but balancing semantic preservation and robustness against adaptive attacks remains challenging. Recent work shows that standard diffusion purification can fail under adaptive evaluation, while test-time score-based optimization is more resilient. Existing optimization defenses, however, typically rely on a single diffusion noise regime or treat timesteps uniformly, overlooking the distinct roles of coarse and fine denoising scales. We propose Temporal Adversarial Rectification Optimization (TARO), an inference-time purification method that builds a temporally guided score prior from multiple denoising views along the diffusion trajectory. TARO forms a coarse-to-fine residual target: high-noise experts provide globally smoothed structure with reduced adversarial sensitivity, while low-noise experts restore image-specific, class-relevant details. A guidance strength controls this temporal correction, allowing TARO to balance robust global rectification with semantic preservation. Empirically, TARO improves robust accuracy across datasets and adaptive threat models in a zero-shot setting, while remaining compatible with complementary adversarial-likelihood objectives for further robustness gains.

LGMay 25, 2023
Score-Based Multimodal Autoencoder

Daniel Wesego, Pedram Rooshenas

Multimodal Variational Autoencoders (VAEs) represent a promising group of generative models that facilitate the construction of a tractable posterior within the latent space given multiple modalities. Previous studies have shown that as the number of modalities increases, the generative quality of each modality declines. In this study, we explore an alternative approach to enhance the generative performance of multimodal VAEs by jointly modeling the latent space of independently trained unimodal VAEs using score-based models (SBMs). The role of the SBM is to enforce multimodal coherence by learning the correlation among the latent variables. Consequently, our model combines a better generative quality of unimodal VAEs with coherent integration across different modalities using the latent score-based model. In addition, our approach provides the best unconditional coherence.