AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder
This work addresses the need for integrated speech processing models, offering a single solution for analysis, control, and generation, though it appears incremental as it builds on masked autoencoder techniques.
The paper tackled the problem of unifying speech analysis, control, and generation by introducing AnCoGen, a method based on a masked autoencoder that estimates attributes like speaker identity and pitch, and generates speech from them, with experiments showing effectiveness in tasks such as speech enhancement and pitch modification.
This article introduces AnCoGen, a novel method that leverages a masked autoencoder to unify the analysis, control, and generation of speech signals within a single model. AnCoGen can analyze speech by estimating key attributes, such as speaker identity, pitch, content, loudness, signal-to-noise ratio, and clarity index. In addition, it can generate speech from these attributes and allow precise control of the synthesized speech by modifying them. Extensive experiments demonstrated the effectiveness of AnCoGen across speech analysis-resynthesis, pitch estimation, pitch modification, and speech enhancement.