CLApr 21, 2015

Big Data Small Data, In Domain Out-of Domain, Known Word Unknown Word: The Impact of Word Representation on Sequence Labelling Tasks

Lizhen Qu, Gabriela Ferraro, Liyuan Zhou, Weiwei Hou, Nathan Schneider, Timothy Baldwin

arXiv:1504.05319v25.65 citations

Originality Synthesis-oriented

AI Analysis

This work provides insights for NLP practitioners on the practical utility and limitations of word embeddings in sequence labeling, though it is incremental as it compares existing methods without introducing new ones.

The paper evaluated five word embedding methods on four sequence labeling tasks, finding that few hundred training instances suffice for competitive results and embeddings improve handling of OOV and out-of-domain cases, with minimal differences between methods and simple Brown clusters often performing competitively.

Word embeddings -- distributed word representations that can be learned from unlabelled data -- have been shown to have high utility in many natural language processing applications. In this paper, we perform an extrinsic evaluation of five popular word embedding methods in the context of four sequence labelling tasks: POS-tagging, syntactic chunking, NER and MWE identification. A particular focus of the paper is analysing the effects of task-based updating of word representations. We show that when using word embeddings as features, as few as several hundred training instances are sufficient to achieve competitive results, and that word embeddings lead to improvements over OOV words and out of domain. Perhaps more surprisingly, our results indicate there is little difference between the different word embedding methods, and that simple Brown clusters are often competitive with word embeddings across all tasks we consider.

View on arXiv PDF

Similar