PitchFlower: A flow-based neural audio codec with pitch controllability
This work addresses the need for controllable audio synthesis in speech processing, offering an extensible framework for disentangling attributes, though it appears incremental as it builds on existing flow-based and neural codec methods.
The authors tackled the problem of neural audio coding with explicit pitch control by introducing PitchFlower, a flow-based codec that uses F0 conditioning and a vector-quantization bottleneck, achieving more accurate pitch control than WORLD with higher audio quality and outperforming SiFiGAN in controllability while maintaining comparable quality.
We present PitchFlower, a flow-based neural audio codec with explicit pitch controllability. Our approach enforces disentanglement through a simple perturbation: during training, F0 contours are flattened and randomly shifted, while the true F0 is provided as conditioning. A vector-quantization bottleneck prevents pitch recovery, and a flow-based decoder generates high quality audio. Experiments show that PitchFlower achieves more accurate pitch control than WORLD at much higher audio quality, and outperforms SiFiGAN in controllability while maintaining comparable quality. Beyond pitch, this framework provides a simple and extensible path toward disentangling other speech attributes.