CL LGFeb 16, 2022

Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model

Hao Zhang, You-Chi Cheng, Shankar Kumar, W. Ronny Huang, Mingqing Chen, Rajiv Mathews

arXiv:2202.08171v10.87 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of improving language model performance in user-generated and noisy text scenarios, such as virtual keyboards and speech recognition, with incremental advancements in efficiency and accuracy.

The paper tackled the problem of capitalization normalization (truecasing) for noisy text by proposing a fast and accurate hierarchical RNN model, which enabled a case-aware language model to achieve the same perplexity as one trained on gold-standard text and reduced prediction error rates in real-world applications like virtual keyboards and ASR.

Capitalization normalization (truecasing) is the task of restoring the correct case (uppercase or lowercase) of noisy text. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model. We use the truecaser to normalize user-generated text in a Federated Learning framework for language modeling. A case-aware language model trained on this normalized text achieves the same perplexity as a model trained on text with gold capitalization. In a real user A/B experiment, we demonstrate that the improvement translates to reduced prediction error rates in a virtual keyboard application. Similarly, in an ASR language model fusion experiment, we show reduction in uppercase character error rate and word error rate.

View on arXiv PDF

Similar