Audio Texture Manipulation by Exemplar-Based Analogy
This addresses the problem of precise audio editing for sound designers and researchers, offering a novel approach but is incremental in applying analogy methods to audio.
The paper tackles audio texture manipulation by proposing an exemplar-based analogy model that uses paired speech examples to learn transformations, outperforming text-conditioned baselines in quantitative evaluations and perceptual studies.
Audio texture manipulation involves modifying the perceptual characteristics of a sound to achieve specific transformations, such as adding, removing, or replacing auditory elements. In this paper, we propose an exemplar-based analogy model for audio texture manipulation. Instead of conditioning on text-based instructions, our method uses paired speech examples, where one clip represents the original sound and another illustrates the desired transformation. The model learns to apply the same transformation to new input, allowing for the manipulation of sound textures. We construct a quadruplet dataset representing various editing tasks, and train a latent diffusion model in a self-supervised manner. We show through quantitative evaluations and perceptual studies that our model outperforms text-conditioned baselines and generalizes to real-world, out-of-distribution, and non-speech scenarios. Project page: https://berkeley-speech-group.github.io/audio-texture-analogy/