CLSep 4, 2021

Data Augmentation for Cross-Domain Named Entity Recognition

Shuguang Chen, Gustavo Aguilar, Leonardo Neves, Thamar Solorio

arXiv:2109.01758v130.9664 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the challenge of limited annotated data in low-resource domains for NER, though it is incremental as it builds on existing data augmentation techniques.

The paper tackles the problem of cross-domain data augmentation for named entity recognition by proposing a neural architecture that projects data from high-resource to low-resource domains, achieving significant improvements over using only high-resource data.

Current work in named entity recognition (NER) shows that data augmentation techniques can produce more robust models. However, most existing techniques focus on augmenting in-domain data in low-resource scenarios where annotated data is quite limited. In contrast, we study cross-domain data augmentation for the NER task. We investigate the possibility of leveraging data from high-resource domains by projecting it into the low-resource domains. Specifically, we propose a novel neural architecture to transform the data representation from a high-resource to a low-resource domain by learning the patterns (e.g. style, noise, abbreviations, etc.) in the text that differentiate them and a shared feature space where both domains are aligned. We experiment with diverse datasets and show that transforming the data to the low-resource domain representation achieves significant improvements over only using data from high-resource domains.

View on arXiv PDF Code

Similar