CLMay 14, 2020

NAT: Noise-Aware Training for Robust Neural Sequence Labeling

arXiv:2005.07162v11002 citations
AI Analysis

This addresses the need for reliable sequence labeling in real-world applications with noisy user-generated text or error-prone upstream components, representing an incremental improvement in robustness methods.

The paper tackles the problem of making neural sequence labeling models robust to corrupted inputs, such as OCR errors and misspellings, by proposing Noise-Aware Training (NAT) objectives, which improved robustness on English and German named entity recognition benchmarks while preserving accuracy on clean data.

Sequence labeling systems should perform reliably not only under ideal conditions but also with corrupted inputs - as these systems often process user-generated text or follow an error-prone upstream component. To this end, we formulate the noisy sequence labeling problem, where the input may undergo an unknown noising process and propose two Noise-Aware Training (NAT) objectives that improve robustness of sequence labeling performed on perturbed input: Our data augmentation method trains a neural model using a mixture of clean and noisy samples, whereas our stability training algorithm encourages the model to create a noise-invariant latent representation. We employ a vanilla noise model at training time. For evaluation, we use both the original data and its variants perturbed with real OCR errors and misspellings. Extensive experiments on English and German named entity recognition benchmarks confirmed that NAT consistently improved robustness of popular sequence labeling models, preserving accuracy on the original input. We make our code and data publicly available for the research community.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes