CLARA: Multilingual Contrastive Learning for Audio Representation Acquisition
This work addresses the challenge of emotion-aware multilingual speech processing, particularly for low-resource languages, though it appears incremental as it builds on existing contrastive learning and data augmentation techniques.
The paper tackles the problem of limited labeled data for multilingual speech emotion recognition by introducing CLARA, a self-supervised learning method that uses a large multilingual audio corpus to develop emotion-enriched speech representations, achieving excellent performance in zero-shot and few-shot learning scenarios.
Multilingual speech processing requires understanding emotions, a task made difficult by limited labelled data. CLARA, minimizes reliance on labelled data, enhancing generalization across languages. It excels at fostering shared representations, aiding cross-lingual transfer of speech and emotions, even with little data. Our approach adeptly captures emotional nuances in speech, overcoming subjective assessment issues. Using a large multilingual audio corpus and self-supervised learning, CLARA develops speech representations enriched with emotions, advancing emotion-aware multilingual speech processing. Our method expands the data range using data augmentation, textual embedding for visual understanding, and transfers knowledge from high- to low-resource languages. CLARA demonstrates excellent performance in emotion recognition, language comprehension, and audio benchmarks, excelling in zero-shot and few-shot learning. It adapts to low-resource languages, marking progress in multilingual speech representation learning.