Reconsidering Degeneration of Token Embeddings with Definitions for Encoder-based Pre-trained Language Models
This work addresses embedding degeneration for low-frequency tokens in NLP models, offering a domain-specific improvement for encoder-based PLMs.
The study tackled the problem of token embedding degeneration in encoder-based pre-trained language models by proposing DefinitionEMB, a method that uses definitions to reconstruct isotropic and semantics-related embeddings, which improved performance on GLUE and text summarization datasets.
Learning token embeddings based on token co-occurrence statistics has proven effective for both pre-training and fine-tuning in natural language processing. However, recent studies have pointed out that the distribution of learned embeddings degenerates into anisotropy (i.e., non-uniform distribution), and even pre-trained language models (PLMs) suffer from a loss of semantics-related information in embeddings for low-frequency tokens. This study first analyzes the fine-tuning dynamics of encoder-based PLMs and demonstrates their robustness against degeneration. On the basis of this analysis, we propose DefinitionEMB, a method that utilizes definitions to re-construct isotropically distributed and semantics-related token embeddings for encoder-based PLMs while maintaining original robustness during fine-tuning. Our experiments demonstrate the effectiveness of leveraging definitions from Wiktionary to re-construct such embeddings for two encoder-based PLMs: RoBERTa-base and BART-large. Furthermore, the re-constructed embeddings for low-frequency tokens improve the performance of these models across various GLUE and four text summarization datasets.