CLSDASMay 15, 2020

Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model

arXiv:2005.07394v122 citations
Originality Incremental advance
AI Analysis

This addresses speech recognition accuracy for social media videos by leveraging contextual metadata, representing an incremental improvement over existing methods.

The paper tackles improving automatic speech recognition for videos by using video metadata descriptions during lattice rescoring, achieving performance improvements through attention-based contextual vectors and a hybrid pointer network approach.

Videos uploaded on social media are often accompanied with textual descriptions. In building automatic speech recognition (ASR) systems for videos, we can exploit the contextual information provided by such video metadata. In this paper, we explore ASR lattice rescoring by selectively attending to the video descriptions. We first use an attention based method to extract contextual vector representations of video metadata, and use these representations as part of the inputs to a neural language model during lattice rescoring. Secondly, we propose a hybrid pointer network approach to explicitly interpolate the word probabilities of the word occurrences in metadata. We perform experimental evaluations on both language modeling and ASR tasks, and demonstrate that both proposed methods provide performance improvements by selectively leveraging the video metadata.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes