CLJun 4, 2021

Modeling the Unigram Distribution

arXiv:2106.02289v1715 citations
Originality Incremental advance
AI Analysis

This addresses a fundamental issue in natural language processing for researchers and practitioners by improving probability estimates for word forms, though it is incremental as it builds on prior work.

The paper tackles the problem of estimating the unigram distribution in language, which is often approximated by sample frequencies leading to biases for out-of-vocabulary words. It presents a novel neural model that produces significantly better estimates across 7 languages compared to naive neural character-level language models.

The unigram distribution is the non-contextual probability of finding a specific word form in a corpus. While of central importance to the study of language, it is commonly approximated by each word's sample frequency in the corpus. This approach, being highly dependent on sample size, assigns zero probability to any out-of-vocabulary (oov) word form. As a result, it produces negatively biased probabilities for any oov word form, while positively biased probabilities to in-corpus words. In this work, we argue in favor of properly modeling the unigram distribution -- claiming it should be a central task in natural language processing. With this in mind, we present a novel model for estimating it in a language (a neuralization of Goldwater et al.'s (2011) model) and show it produces much better estimates across a diverse set of 7 languages than the naïve use of neural character-level language models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes