Adaptive Semiparametric Language Models
This work addresses the problem of improving language model performance for researchers and practitioners by adaptively leveraging multiple memory sources, representing an incremental improvement in language modeling architectures.
This paper introduces a language model that integrates a parametric neural network with a non-parametric episodic memory. The model adaptively combines local context, short-term memory, and long-term memory to make predictions, showing efficacy on word-based and character-based language modeling datasets compared to strong baselines.
We present a language model that combines a large parametric neural network (i.e., a transformer) with a non-parametric episodic memory component in an integrated architecture. Our model uses extended short-term context by caching local hidden states -- similar to transformer-XL -- and global long-term memory by retrieving a set of nearest neighbor tokens at each timestep. We design a gating function to adaptively combine multiple information sources to make a prediction. This mechanism allows the model to use either local context, short-term memory, or long-term memory (or any combination of them) on an ad hoc basis depending on the context. Experiments on word-based and character-based language modeling datasets demonstrate the efficacy of our proposed method compared to strong baselines.