CL AIJul 24, 2017

Learning Rare Word Representations using Semantic Bridging

Victor Prokhorov, Mohammad Taher Pilehvar, Dimitri Kartsaklis, Pietro Lió, Nigel Collier

arXiv:1707.07554v10.71 citations

Originality Synthesis-oriented

AI Analysis

This addresses the issue of limited vocabulary coverage in word embeddings for natural language processing tasks, though it is incremental as it adapts existing methods.

The paper tackled the problem of enhancing word embedding coverage for rare and unseen words by merging corpus and ontological knowledge using adapted graph embedding and cross-lingual mapping techniques, achieving over 99% extra coverage and around 10% absolute performance gain on the Rare Word Similarity dataset.

We propose a methodology that adapts graph embedding techniques (DeepWalk (Perozzi et al., 2014) and node2vec (Grover and Leskovec, 2016)) as well as cross-lingual vector space mapping approaches (Least Squares and Canonical Correlation Analysis) in order to merge the corpus and ontological sources of lexical knowledge. We also perform comparative analysis of the used algorithms in order to identify the best combination for the proposed system. We then apply this to the task of enhancing the coverage of an existing word embedding's vocabulary with rare and unseen words. We show that our technique can provide considerable extra coverage (over 99%), leading to consistent performance gain (around 10% absolute gain is achieved with w2v-gn-500K cf.§3.3) on the Rare Word Similarity dataset.

View on arXiv PDF

Similar