CLNov 21, 2018

The Best of Both Worlds: Lexical Resources To Improve Low-Resource Part-of-Speech Tagging

Barbara Plank, Sigrid Klerke, Zeljko Agic

arXiv:1811.08757v10.75 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of improving part-of-speech tagging for low-resource languages by leveraging available lexical information, representing an incremental analysis rather than a novel method.

The paper tackles the problem of low-resource part-of-speech tagging by analyzing how integrating lexical resources improves neural cross-lingual methods, finding that such integration yields benefits, though the extent depends on resource coverage and quality.

In natural language processing, the deep learning revolution has shifted the focus from conventional hand-crafted symbolic representations to dense inputs, which are adequate representations learned automatically from corpora. However, particularly when working with low-resource languages, small amounts of symbolic lexical resources such as user-generated lexicons are often available even when gold-standard corpora are not. Such additional linguistic information is though often neglected, and recent neural approaches to cross-lingual tagging typically rely only on word and subword embeddings. While these representations are effective, our recent work has shown clear benefits of combining the best of both worlds: integrating conventional lexical information improves neural cross-lingual part-of-speech (PoS) tagging. However, little is known on how complementary such additional information is, and to what extent improvements depend on the coverage and quality of these external resources. This paper seeks to fill this gap by providing the first thorough analysis on the contributions of lexical resources for cross-lingual PoS tagging in neural times.

View on arXiv PDF

Similar