Lattention: Lattice-attention in ASR rescoring
This work addresses incremental improvements in speech recognition for downstream tasks like spoken language understanding.
The paper tackled the problem of improving automatic speech recognition accuracy by using lattice representations for rescoring n-best lists, achieving a 4-5% relative word error rate reduction with lattice attention and 6-8% when combined with acoustic features.
Lattices form a compact representation of multiple hypotheses generated from an automatic speech recognition system and have been shown to improve performance of downstream tasks like spoken language understanding and speech translation, compared to using one-best hypothesis. In this work, we look into the effectiveness of lattice cues for rescoring n-best lists in second-pass. We encode lattices with a recurrent network and train an attention encoder-decoder model for n-best rescoring. The rescoring model with attention to lattices achieves 4-5% relative word error rate reduction over first-pass and 6-8% with attention to both lattices and acoustic features. We show that rescoring models with attention to lattices outperform models with attention to n-best hypotheses. We also study different ways to incorporate lattice weights in the lattice encoder and demonstrate their importance for n-best rescoring.