How to Evaluate Word Representations of Informal Domain?
This addresses a bottleneck for NLP researchers and practitioners working with informal text, though it is incremental as it builds on existing evaluation methods.
The paper tackled the challenge of evaluating word embeddings in informal domains like Twitter by automatically extracting variant spelling pairs from UrbanDictionary using weakly-supervised bootstrapping and self-training CRF, enabling direct use of non-standard word representations without text normalization.
Diverse word representations have surged in most state-of-the-art natural language processing (NLP) applications. Nevertheless, how to efficiently evaluate such word embeddings in the informal domain such as Twitter or forums, remains an ongoing challenge due to the lack of sufficient evaluation dataset. We derived a large list of variant spelling pairs from UrbanDictionary with the automatic approaches of weakly-supervised pattern-based bootstrapping and self-training linear-chain conditional random field (CRF). With these extracted relation pairs we promote the odds of eliding the text normalization procedure of traditional NLP pipelines and directly adopting representations of non-standard words in the informal domain. Our code is available.