SD LG ASJul 18, 2023

OxfordVGG Submission to the EGO4D AV Transcription Challenge

arXiv:2307.09006v12.3h-index: 188Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses transcription accuracy in audio-visual data for applications like video analysis, but it is incremental as it builds on existing methods.

The paper tackled the problem of automatic speech recognition for long-form audio with word-level time alignment in the EGO4D AV Transcription Challenge, achieving a Word Error Rate of 56.0% and ranking first on the leaderboard.

This report presents the technical details of our submission on the EGO4D Audio-Visual (AV) Automatic Speech Recognition Challenge 2023 from the OxfordVGG team. We present WhisperX, a system for efficient speech transcription of long-form audio with word-level time alignment, along with two text normalisers which are publicly available. Our final submission obtained 56.0% of the Word Error Rate (WER) on the challenge test set, ranked 1st on the leaderboard. All baseline codes and models are available on https://github.com/m-bain/whisperX.

View on arXiv PDF Code

Similar