CLJan 13, 2022

Speech Resources in the Tamasheq Language

arXiv:2201.05051v3587 citations
AI Analysis

This provides resources for the speech community to work on Tamasheq, an incremental contribution by making existing data available for benchmarking.

The paper introduces two datasets for Tamasheq, a low-resource language, including 671 hours of unlabeled audio in five languages and a 17-hour parallel Tamasheq-French corpus, aimed at supporting speech translation research.

In this paper we present two datasets for Tamasheq, a developing language mainly spoken in Mali and Niger. These two datasets were made available for the IWSLT 2022 low-resource speech translation track, and they consist of collections of radio recordings from daily broadcast news in Niger (Studio Kalangou) and Mali (Studio Tamani). We share (i) a massive amount of unlabeled audio data (671 hours) in five languages: French from Niger, Fulfulde, Hausa, Tamasheq and Zarma, and (ii) a smaller 17 hours parallel corpus of audio recordings in Tamasheq, with utterance-level translations in the French language. All this data is shared under the Creative Commons BY-NC-ND 3.0 license. We hope these resources will inspire the speech community to develop and benchmark models using the Tamasheq language.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes