SDCLASJul 2, 2022

Tree-constrained Pointer Generator with Graph Neural Network Encodings for Contextual Speech Recognition

arXiv:2207.00857v119 citationsh-index: 64
Originality Incremental advance
AI Analysis

This work addresses the need for more accurate automatic speech recognition in applications requiring contextual knowledge, such as meeting transcription with slide-based cues, representing an incremental improvement over existing techniques.

The paper tackled the problem of improving contextual speech recognition by incorporating biasing words, achieving about a 15% relative word error rate reduction on biasing words compared to prior methods with minimal computational overhead.

Incorporating biasing words obtained as contextual knowledge is critical for many automatic speech recognition (ASR) applications. This paper proposes the use of graph neural network (GNN) encodings in a tree-constrained pointer generator (TCPGen) component for end-to-end contextual ASR. By encoding the biasing words in the prefix-tree with a tree-based GNN, lookahead for future wordpieces in end-to-end ASR decoding is achieved at each tree node by incorporating information about all wordpieces on the tree branches rooted from it, which allows a more accurate prediction of the generation probability of the biasing words. Systems were evaluated on the Librispeech corpus using simulated biasing tasks, and on the AMI corpus by proposing a novel visual-grounded contextual ASR pipeline that extracts biasing words from slides alongside each meeting. Results showed that TCPGen with GNN encodings achieved about a further 15% relative WER reduction on the biasing words compared to the original TCPGen, with a negligible increase in the computation cost for decoding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes