CLJul 21, 2017

Mimicking Word Embeddings using Subword RNNs

arXiv:1707.06961v11146 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the OOV issue in NLP tasks, particularly benefiting low-resource settings, though it is incremental as it builds on existing word embedding methods.

The paper tackles the problem of out-of-vocabulary (OOV) words in NLP by proposing MIMICK, an approach that generates embeddings for OOV words compositionally from spellings without re-training on the original corpus, resulting in improved performance for part-of-speech and morphosyntactic tagging across 23 languages.

Word embeddings improve generalization over lexical features by placing each word in a lower-dimensional space, using distributional information obtained from unlabeled data. However, the effectiveness of word embeddings for downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which embeddings do not exist. In this paper, we present MIMICK, an approach to generating OOV word embeddings compositionally, by learning a function from spellings to distributional embeddings. Unlike prior work, MIMICK does not require re-training on the original word embedding corpus; instead, learning is performed at the type level. Intrinsic and extrinsic evaluations demonstrate the power of this simple approach. On 23 languages, MIMICK improves performance over a word-based baseline for tagging part-of-speech and morphosyntactic attributes. It is competitive with (and complementary to) a supervised character-based model in low-resource settings.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes