ZIPA: A family of efficient models for multilingual phone recognition
This work addresses efficient crosslinguistic phone recognition for speech processing, but it is incremental as it builds on existing methods like Zipformer and noisy student training.
The authors tackled multilingual phone recognition by introducing ZIPA, a family of efficient models that achieve state-of-the-art performance with fewer parameters, as demonstrated on a large-scale corpus of 17,132 hours and further improved via scaling with 11,000 hours of pseudo-labeled data.
We present ZIPA, a family of efficient speech models that advances the state-of-the-art performance of crosslinguistic phone recognition. We first curated IPAPack++, a large-scale multilingual speech corpus with 17,132 hours of normalized phone transcriptions and a novel evaluation set capturing unseen languages and sociophonetic variation. With the large-scale training data, ZIPA, including transducer (ZIPA-T) and CTC-based (ZIPA-CR) variants, leverage the efficient Zipformer backbones and outperform existing phone recognition systems with much fewer parameters. Further scaling via noisy student training on 11,000 hours of pseudo-labeled multilingual data yields further improvement. While ZIPA achieves strong performance on benchmarks, error analysis reveals persistent limitations in modeling sociophonetic diversity, underscoring challenges for future research.