CLJan 30, 2020

An Efficient Architecture for Predicting the Case of Characters using Sequence Models

Gopi Ramena, Divija Nagaraju, Sukumar Moharana, Debi Prasanna Mohanty, Naresh Purre

arXiv:2002.00738v18 citations

AI Analysis

This incremental improvement in truecasing enhances preprocessing for NLP applications, benefiting tasks reliant on clean textual data.

The paper tackles the problem of restoring correct character case (truecasing) in noisy text from sources like social media, using a CNN-BiLSTM-CRF architecture at the character level without feature engineering, achieving a 0.83 F1 score improvement over the state of the art.

The dearth of clean textual data often acts as a bottleneck in several natural language processing applications. The data available often lacks proper case (uppercase or lowercase) information. This often comes up when text is obtained from social media, messaging applications and other online platforms. This paper attempts to solve this problem by restoring the correct case of characters, commonly known as Truecasing. Doing so improves the accuracy of several processing tasks further down in the NLP pipeline. Our proposed architecture uses a combination of convolutional neural networks (CNN), bi-directional long short-term memory networks (LSTM) and conditional random fields (CRF), which work at a character level without any explicit feature engineering. In this study we compare our approach to previous statistical and deep learning based approaches. Our method shows an increment of 0.83 in F1 score over the current state of the art. Since truecasing acts as a preprocessing step in several applications, every increment in the F1 score leads to a significant improvement in the language processing tasks.

View on arXiv PDF

Similar