Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost
This addresses robustness issues in NLP systems for users dealing with OOV words, though it is incremental as it builds on mimick-like models.
The paper tackles the problem of Out-of-Vocabulary (OOV) words making NLP systems brittle by proposing LOVE, a simple contrastive learning framework that generates embeddings for unseen words using only surface forms. The lightweight model achieves similar or better performance than prior methods on original and corrupted datasets, and improves robustness when used with FastText and BERT.
State-of-the-art NLP systems represent inputs with word embeddings, but these are brittle when faced with Out-of-Vocabulary (OOV) words. To address this issue, we follow the principle of mimick-like models to generate vectors for unseen words, by learning the behavior of pre-trained embeddings using only the surface form of words. We present a simple contrastive learning framework, LOVE, which extends the word representation of an existing pre-trained language model (such as BERT), and makes it robust to OOV with few additional parameters. Extensive evaluations demonstrate that our lightweight model achieves similar or even better performances than prior competitors, both on original datasets and on corrupted variants. Moreover, it can be used in a plug-and-play fashion with FastText and BERT, where it significantly improves their robustness.