CL CV LG SD ASMar 1, 2024

Transcription and translation of videos using fine-tuned XLSR Wav2Vec2 on custom dataset and mBART

Aniket Tathe, Anand Kamble, Suyash Kumbharkar, Atharva Bhandare, Anirban C. Mitra

arXiv:2403.00212v13 citationsh-index: 3

Originality Synthesis-oriented

AI Analysis

This provides an accessible solution for multilingual video transcription and translation for personalized voice, but it is incremental as it combines existing methods like fine-tuning XLSR Wav2Vec2 and mBART on a custom dataset.

The research tackled the challenge of training an automatic speech recognition model for personalized voices with minimal data, achieving transcription and translation of Hindi videos using only 14 minutes of custom audio and delivering a web-based GUI for accessible multilingual content.

This research addresses the challenge of training an ASR model for personalized voices with minimal data. Utilizing just 14 minutes of custom audio from a YouTube video, we employ Retrieval-Based Voice Conversion (RVC) to create a custom Common Voice 16.0 corpus. Subsequently, a Cross-lingual Self-supervised Representations (XLSR) Wav2Vec2 model is fine-tuned on this dataset. The developed web-based GUI efficiently transcribes and translates input Hindi videos. By integrating XLSR Wav2Vec2 and mBART, the system aligns the translated text with the video timeline, delivering an accessible solution for multilingual video content transcription and translation for personalized voice.

View on arXiv PDF

Similar