AS CL LG SDOct 31, 2018

On The Inductive Bias of Words in Acoustics-to-Word Models

arXiv:1810.13407v22.31 citations

Originality Incremental advance

AI Analysis

This work addresses training challenges in end-to-end speech recognition for researchers, but it is incremental as it builds on existing models with specific biases.

The study investigated how training data quantity and inductive biases affect acoustics-to-word models, finding that constraining word durations yields the most improvement, with word embeddings dominated by pronunciation rather than context.

Acoustics-to-word models are end-to-end speech recognizers that use words as targets without relying on pronunciation dictionaries or graphemes. These models are notoriously difficult to train due to the lack of linguistic knowledge. It is also unclear how the amount of training data impacts the optimization and generalization of such models. In this work, we study the optimization and generalization of acoustics-to-word models under different amounts of training data. In addition, we study three types of inductive bias, leveraging a pronunciation dictionary, word boundary annotations, and constraints on word durations. We find that constraining word durations leads to the most improvement. Finally, we analyze the word embedding space learned by the model, and find that the space has a structure dominated by the pronunciation of words. This suggests that the contexts of words, instead of their phonetic structure, should be the future focus of inductive bias in acoustics-to-word models.

View on arXiv PDF

Similar