Subword and Crossword Units for CTC Acoustic Models
This addresses the trade-off between unit set size and training data in speech recognition, but it is incremental as it builds on existing CTC and language model methods.
The paper tackles the problem of selecting unit sets for CTC-based speech recognition by using Byte Pair Encoding to learn units of arbitrary size, achieving state-of-the-art results for grapheme-based CTC systems.
This paper proposes a novel approach to create an unit set for CTC based speech recognition systems. By using Byte Pair Encoding we learn an unit set of an arbitrary size on a given training text. In contrast to using characters or words as units this allows us to find a good trade-off between the size of our unit set and the available training data. We evaluate both Crossword units, that may span multiple word, and Subword units. By combining this approach with decoding methods using a separate language model we are able to achieve state of the art results for grapheme based CTC systems.