CL SDMay 14

A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

arXiv:2605.1442755.1

Predicted impact top 98% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For researchers training end-to-end ASR systems, this provides a formal method to choose vocabulary size, an important hyper-parameter, though it builds on prior work and is incremental.

This paper proposes a calculus-based framework to determine the optimal vocabulary size for end-to-end ASR systems, using first and second derivative tests on curve-fitted training data. Applied to LibriSpeech, the method improves ASR performance by selecting a better vocabulary size hyper-parameter.

In hybrid automatic speech recognition (ASR) systems, the vocabulary size is unambiguous, typically determined by the number of phones, bi-phones, or tri-phones present in the language. In contrast, end-to-end ASR systems derive their vocabulary, often referred to as tokens from the text corpus used for training. The choice and, more importantly, the size of this vocabulary is a critical hyper-parameter in training end-to-end ASR systems. Tokenization algorithms such as Byte Pair Encoding (BPE), WordPiece, and Unigram Language Model (ULM) use the vocabulary size as an input hyper-parameter to generate the sub-words employed during ASR training. Popular toolkits like ESPNet provide a fixed vocabulary size in their training recipes, but there is little documentation or discussion in the literature regarding how these values are determined. Recent work [1] has formalized an approach to identify the vocabulary size best suited for end-to-end ASR, introducing a cost function framework that treats the tokenization process as a black box. In this paper, we build upon that foundation by curve fitting the training data and using the principle of first and second derivative tests in calculus to formally estimate the vocabulary size hyper-parameter. We demonstrate the utility and usefulness of our approach by applying it on a standard Librispeech corpus and show that the optimal choice of vocabulary size hyper-parameter improves the performance of the ASR. The main contribution of this paper in formalizing an approach to identify the vocabulary size best suited for training an end-to-end ASR system.

View on arXiv PDF

Similar