FCPE: A Fast Context-based Pitch Estimation Model
This work addresses noise robustness and computational efficiency in pitch estimation for applications like MIDI transcription and singing voice conversion, representing an incremental improvement over existing methods.
The paper tackles the problem of pitch estimation in monophonic audio, which degrades under noise, by proposing FCPE, a fast context-based model that achieves 96.79% Raw Pitch Accuracy on the MIR-1K dataset and a Real-Time Factor of 0.0062, matching state-of-the-art accuracy while significantly improving efficiency.
Pitch estimation (PE) in monophonic audio is crucial for MIDI transcription and singing voice conversion (SVC), but existing methods suffer significant performance degradation under noise. In this paper, we propose FCPE, a fast context-based pitch estimation model that employs a Lynx-Net architecture with depth-wise separable convolutions to effectively capture mel spectrogram features while maintaining low computational cost and robust noise tolerance. Experiments show that our method achieves 96.79\% Raw Pitch Accuracy (RPA) on the MIR-1K dataset, on par with the state-of-the-art methods. The Real-Time Factor (RTF) is 0.0062 on a single RTX 4090 GPU, which significantly outperforms existing algorithms in efficiency. Code is available at https://github.com/CNChTu/FCPE.