CLAug 26, 2021

Position-Invariant Truecasing with a Word-and-Character Hierarchical Recurrent Neural Network

Hao Zhang, You-Chi Cheng, Shankar Kumar, Mingqing Chen, Rajiv Mathews

arXiv:2108.11943v20.51 citations

Originality Incremental advance

AI Analysis

This addresses the problem of improving downstream NLP tasks like named entity recognition and language modeling for users dealing with noisy text from speech recognition or machine translation, but it is incremental as it builds on existing RNN methods.

The paper tackles the problem of truecasing, which restores correct case in noisy text, by proposing a fast, accurate, and compact two-level hierarchical word-and-character RNN model, the first of its kind for this task, and uses sequence distillation to achieve position-invariant truecasing.

Truecasing is the task of restoring the correct case (uppercase or lowercase) of noisy text generated either by an automatic system for speech recognition or machine translation or by humans. It improves the performance of downstream NLP tasks such as named entity recognition and language modeling. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model, the first of its kind for this problem. Using sequence distillation, we also address the problem of truecasing while ignoring token positions in the sentence, i.e. in a position-invariant manner.

View on arXiv PDF

Similar