TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument
This work addresses the need for flexible and intuitive sound design tools in audio generation, though it appears incremental by building on existing neural audio codec and transformer methods.
The paper tackles the problem of generating audio from MIDI and text inputs by proposing TokenSynth, a neural synthesizer that uses a decoder-only transformer to produce audio tokens, enabling instrument cloning, text-to-instrument synthesis, and timbre manipulation without fine-tuning, with evaluation showing potential in audio quality and accuracy.
Recent advancements in neural audio codecs have enabled the use of tokenized audio representations in various audio generation tasks, such as text-to-speech, text-to-audio, and text-to-music generation. Leveraging this approach, we propose TokenSynth, a novel neural synthesizer that utilizes a decoder-only transformer to generate desired audio tokens from MIDI tokens and CLAP (Contrastive Language-Audio Pretraining) embedding, which has timbre-related information. Our model is capable of performing instrument cloning, text-to-instrument synthesis, and text-guided timbre manipulation without any fine-tuning. This flexibility enables diverse sound design and intuitive timbre control. We evaluated the quality of the synthesized audio, the timbral similarity between synthesized and target audio/text, and synthesis accuracy (i.e., how accurately it follows the input MIDI) using objective measures. TokenSynth demonstrates the potential of leveraging advanced neural audio codecs and transformers to create powerful and versatile neural synthesizers. The source code, model weights, and audio demos are available at: https://github.com/KyungsuKim42/tokensynth