CLJul 5, 2022

ASR-Generated Text for Language Model Pre-training Applied to Speech Tasks

arXiv:2207.01893v18 citationsh-index: 42
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of building effective spoken language models for applications such as speech understanding and parsing, though it is incremental as it builds on existing methods with new data.

The researchers tackled the problem of improving spoken language modeling by using a large amount of automatically transcribed speech from 350,000 hours of TV shows, resulting in models that showed benefits in downstream tasks like spoken language understanding and speech syntactic parsing compared to the initial version.

We aim at improving spoken language modeling (LM) using very large amount of automatically transcribed speech. We leverage the INA (French National Audiovisual Institute) collection and obtain 19GB of text after applying ASR on 350,000 hours of diverse TV shows. From this, spoken language models are trained either by fine-tuning an existing LM (FlauBERT) or through training a LM from scratch. New models (FlauBERT-Oral) are shared with the community and evaluated for 3 downstream tasks: spoken language understanding, classification of TV shows and speech syntactic parsing. Results show that FlauBERT-Oral can be beneficial compared to its initial FlauBERT version demonstrating that, despite its inherent noisy nature, ASR-generated text can be used to build spoken language models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes