CLLGOct 21, 2020

PBoS: Probabilistic Bag-of-Subwords for Generalizing Word Embedding

arXiv:2010.10813v1993 citations
Originality Highly original
AI Analysis

This addresses the challenge of handling unknown words in NLP tasks, offering a method that outperforms previous subword-level models without requiring explicit morphological knowledge.

The paper tackles the problem of generalizing word embeddings to out-of-vocabulary words using only spellings, proposing a probabilistic bag-of-subwords model that improves embedding quality across languages in word similarity and POS tagging experiments.

We look into the task of \emph{generalizing} word embeddings: given a set of pre-trained word vectors over a finite vocabulary, the goal is to predict embedding vectors for out-of-vocabulary words, \emph{without} extra contextual information. We rely solely on the spellings of words and propose a model, along with an efficient algorithm, that simultaneously models subword segmentation and computes subword-based compositional word embedding. We call the model probabilistic bag-of-subwords (PBoS), as it applies bag-of-subwords for all possible segmentations based on their likelihood. Inspections and affix prediction experiment show that PBoS is able to produce meaningful subword segmentations and subword rankings without any source of explicit morphological knowledge. Word similarity and POS tagging experiments show clear advantages of PBoS over previous subword-level models in the quality of generated word embeddings across languages.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes