IR LGSep 25, 2021

Topic Model Robustness to Automatic Speech Recognition Errors in Podcast Transcripts

Raluca Alexandra Fetic, Mikkel Jordahn, Lucas Chaves Lima, Rasmus Arpe Fogh Egebæk, Martin Carsten Nielsen, Benjamin Biering, Lars Kai Hansen

arXiv:2109.12306v12.0

Originality Synthesis-oriented

AI Analysis

This work addresses content recommendation challenges for a multilingual podcast streaming service, but it is incremental as it applies an existing method to a new low-resource language context.

The study investigated the robustness of Latent Dirichlet Allocation topic models to errors in automatic speech recognition transcripts for Danish podcasts, finding that high-quality topic embeddings can still be obtained even with increasing transcription noise.

For a multilingual podcast streaming service, it is critical to be able to deliver relevant content to all users independent of language. Podcast content relevance is conventionally determined using various metadata sources. However, with the increasing quality of speech recognition in many languages, utilizing automatic transcriptions to provide better content recommendations becomes possible. In this work, we explore the robustness of a Latent Dirichlet Allocation topic model when applied to transcripts created by an automatic speech recognition engine. Specifically, we explore how increasing transcription noise influences topics obtained from transcriptions in Danish; a low resource language. First, we observe a baseline of cosine similarity scores between topic embeddings from automatic transcriptions and the descriptions of the podcasts written by the podcast creators. We then observe how the cosine similarities decrease as transcription noise increases and conclude that even when automatic speech recognition transcripts are erroneous, it is still possible to obtain high-quality topic embeddings from the transcriptions.

View on arXiv PDF

Similar