CL AI DLJan 11, 2025

A Survey on Spoken Italian Datasets and Corpora

arXiv:2501.06557v22.73 citationsh-index: 15IEEE Access

Originality Synthesis-oriented

AI Analysis

It addresses the problem of limited spoken Italian datasets for researchers and developers in linguistics and speech technology, but it is incremental as it compiles existing resources.

This survey analyzed 66 spoken Italian datasets to address the scarcity of resources for Italian compared to major languages, providing a comprehensive inventory and recommendations to support research and technology development.

Spoken language datasets are vital for advancing linguistic research, Natural Language Processing, and speech technology. However, resources dedicated to Italian, a linguistically rich and diverse Romance language, remain underexplored compared to major languages like English or Mandarin. This survey provides a comprehensive analysis of 66 spoken Italian datasets, highlighting their characteristics, methodologies, and applications. The datasets are categorized by speech type, source and context, and demographic and linguistic features, with a focus on their utility in fields such as Automatic Speech Recognition, emotion detection, and education. Challenges related to dataset scarcity, representativeness, and accessibility are discussed alongside recommendations for enhancing dataset creation and utilization. The full dataset inventory is publicly accessible via GitHub and archived on Zenodo, serving as a valuable resource for researchers and developers. By addressing current gaps and proposing future directions, this work aims to support the advancement of Italian speech technologies and linguistic research.

View on arXiv PDF

Similar