Automatically Identifying Language Family from Acoustic Examples in Low Resource Scenarios
This work addresses the problem of understanding language similarity for speech-based NLP tasks, particularly in low-resource scenarios where existing multilingual NLP works are limited.
This paper proposes a deep learning method to analyze language similarity from acoustic examples, specifically by training a model on the Wilderness dataset and comparing its latent space with classical language family findings. This approach offers a new direction for cross-lingual data augmentation in speech-based NLP tasks.
Existing multilingual speech NLP works focus on a relatively small subset of languages, and thus current linguistic understanding of languages predominantly stems from classical approaches. In this work, we propose a method to analyze language similarity using deep learning. Namely, we train a model on the Wilderness dataset and investigate how its latent space compares with classical language family findings. Our approach provides a new direction for cross-lingual data augmentation in any speech-based NLP task.