LGCLMLSep 16, 2018

Curriculum-Based Neighborhood Sampling For Sequence Prediction

arXiv:1809.05916v11 citations
Originality Incremental advance
AI Analysis

This work addresses a known bottleneck in sequence prediction for language modeling, offering an incremental improvement over existing methods like scheduled sampling.

The paper tackles exposure bias in language models by proposing a curriculum learning method that gradually introduces stochasticity into the teacher policy, using nearest-neighbor replacement sampling to explore alternatives and reduce compounding errors. The approach performs well when combined with scheduled sampling on two language modeling benchmarks, though specific numerical gains are not detailed.

The task of multi-step ahead prediction in language models is challenging considering the discrepancy between training and testing. At test time, a language model is required to make predictions given past predictions as input, instead of the past targets that are provided during training. This difference, known as exposure bias, can lead to the compounding of errors along a generated sequence at test time. In order to improve generalization in neural language models and address compounding errors, we propose a curriculum learning based method that gradually changes an initially deterministic teacher policy to a gradually more stochastic policy, which we refer to as \textit{Nearest-Neighbor Replacement Sampling}. A chosen input at a given timestep is replaced with a sampled nearest neighbor of the past target with a truncated probability proportional to the cosine similarity between the original word and its top $k$ most similar words. This allows the teacher to explore alternatives when the teacher provides a sub-optimal policy or when the initial policy is difficult for the learner to model. The proposed strategy is straightforward, online and requires little additional memory requirements. We report our main findings on two language modelling benchmarks and find that the proposed approach performs particularly well when used in conjunction with scheduled sampling, that too attempts to mitigate compounding errors in language models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes