CLAug 2, 2024

Reconsidering Degeneration of Token Embeddings with Definitions for Encoder-based Pre-trained Language Models

arXiv:2408.01308v22 citationsh-index: 11
AI Analysis

This work addresses embedding degeneration for low-frequency tokens in NLP models, offering a domain-specific improvement for encoder-based PLMs.

The study tackled the problem of token embedding degeneration in encoder-based pre-trained language models by proposing DefinitionEMB, a method that uses definitions to reconstruct isotropic and semantics-related embeddings, which improved performance on GLUE and text summarization datasets.

Learning token embeddings based on token co-occurrence statistics has proven effective for both pre-training and fine-tuning in natural language processing. However, recent studies have pointed out that the distribution of learned embeddings degenerates into anisotropy (i.e., non-uniform distribution), and even pre-trained language models (PLMs) suffer from a loss of semantics-related information in embeddings for low-frequency tokens. This study first analyzes the fine-tuning dynamics of encoder-based PLMs and demonstrates their robustness against degeneration. On the basis of this analysis, we propose DefinitionEMB, a method that utilizes definitions to re-construct isotropically distributed and semantics-related token embeddings for encoder-based PLMs while maintaining original robustness during fine-tuning. Our experiments demonstrate the effectiveness of leveraging definitions from Wiktionary to re-construct such embeddings for two encoder-based PLMs: RoBERTa-base and BART-large. Furthermore, the re-constructed embeddings for low-frequency tokens improve the performance of these models across various GLUE and four text summarization datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes