CLSep 20, 2025

MoRoVoc: A Large Dataset for Geographical Variation Identification of the Spoken Romanian Language

Andrei-Marius Avram, Ema-Ioana Bănescu, Anda-Teodora Robea, Dumitru-Clementin Cercel, Mihaela-Claudia Cercel

arXiv:2509.16781v16.72 citationsh-index: 13EMNLP

Originality Incremental advance

AI Analysis

This work addresses the need for robust speech models in dialect identification for Romanian speakers, though it is incremental as it builds on existing adversarial training methods.

The paper tackles the problem of identifying geographical variation in spoken Romanian by introducing MoRoVoc, a large dataset with over 93 hours of audio, and proposes a multi-target adversarial training framework that dynamically adjusts coefficients via meta-learning. The result includes Wav2Vec2-Base achieving 78.21% accuracy for variation identification and Wav2Vec2-Large reaching 93.08% accuracy for gender classification.

This paper introduces MoRoVoc, the largest dataset for analyzing the regional variation of spoken Romanian. It has more than 93 hours of audio and 88,192 audio samples, balanced between the Romanian language spoken in Romania and the Republic of Moldova. We further propose a multi-target adversarial training framework for speech models that incorporates demographic attributes (i.e., age and gender of the speakers) as adversarial targets, making models discriminative for primary tasks while remaining invariant to secondary attributes. The adversarial coefficients are dynamically adjusted via meta-learning to optimize performance. Our approach yields notable gains: Wav2Vec2-Base achieves 78.21% accuracy for the variation identification of spoken Romanian using gender as an adversarial target, while Wav2Vec2-Large reaches 93.08% accuracy for gender classification when employing both dialect and age as adversarial objectives.

View on arXiv PDF

Similar