CL SD ASSep 6, 2023

RoDia: A New Dataset for Romanian Dialect Identification from Speech

Codrut Rotaru, Nicolae-Catalin Ristea, Radu Tudor Ionescu

arXiv:2309.03378v39.932 citationsh-index: 37Has Code

Originality Synthesis-oriented

AI Analysis

This provides a new resource for researchers working on dialect identification in Romanian, addressing a previously unaddressed problem in this language.

The authors introduced RoDia, the first dataset for Romanian dialect identification from speech, containing 2 hours of annotated speech from five regions, and established baseline models with top scores of 59.83% macro F1 and 62.08% micro F1.

We introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We release our dataset at https://github.com/codrut2/RoDia.

View on arXiv PDF Code

Similar