SDCVLGASMay 22, 2023

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

arXiv:2305.13050v129 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of extending image generation to audio conditioning, offering a lightweight solution for multimodal applications, though it is incremental as it builds on existing text-to-image models.

The paper tackles the problem of adapting text-conditioned diffusion models for audio-to-image generation by introducing a novel method that encodes audio into a token, requiring few trainable parameters and achieving superior results over baselines in objective and subjective metrics.

In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the question: "how can we adopt such models to be conditioned on other modalities?". In this paper, we propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings. Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations. Such a modeling paradigm requires a small number of trainable parameters, making the proposed approach appealing for lightweight optimization. Results suggest the proposed method is superior to the evaluated baseline methods, considering objective and subjective metrics. Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes