CLLGASNov 20, 2019

On using 2D sequence-to-sequence models for speech recognition

arXiv:1911.08888v18 citations
Originality Incremental advance
AI Analysis

This work addresses automatic speech recognition for incremental improvements in model design.

The paper tackled speech recognition by proposing a 2DLSTM architecture to model input-output relations without attention, achieving competitive word error rates on the Switchboard 300h task.

Attention-based sequence-to-sequence models have shown promising results in automatic speech recognition. Using these architectures, one-dimensional input and output sequences are related by an attention approach, thereby replacing more explicit alignment processes, like in classical HMM-based modeling. In contrast, here we apply a novel two-dimensional long short-term memory (2DLSTM) architecture to directly model the input/output relation between audio/feature vector sequences and word sequences. The proposed model is an alternative model such that instead of using any type of attention components, we apply a 2DLSTM layer to assimilate the context from both input observations and output transcriptions. The experimental evaluation on the Switchboard 300h automatic speech recognition task shows word error rates for the 2DLSTM model that are competitive to end-to-end attention-based model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes