Sparse Coding of Neural Word Embeddings for Multilingual Sequence Labeling
This addresses efficient and generalizable sequence labeling for multiple languages, but it is incremental as it builds on existing sparse coding and embedding techniques.
The paper tackles multilingual sequence labeling by using sparse indicator features from dense word embeddings, achieving near state-of-the-art performance for part-of-speech tagging and named entity recognition across languages. It retains over 89.8% of average POS tagging accuracy when trained on only 1.2% of the total data, i.e., 150 sentences per language.
In this paper we propose and carefully evaluate a sequence labeling framework which solely utilizes sparse indicator features derived from dense distributed word representations. The proposed model obtains (near) state-of-the art performance for both part-of-speech tagging and named entity recognition for a variety of languages. Our model relies only on a few thousand sparse coding-derived features, without applying any modification of the word representations employed for the different tasks. The proposed model has favorable generalization properties as it retains over 89.8% of its average POS tagging accuracy when trained at 1.2% of the total available training data, i.e.~150 sentences per language.