IT ITMay 8

Semantic Smoothing for Language Models via Distribution Estimation and Embeddings

Haricharan Balasundaram, Swathi Shree Narashiman, Pranay Mathur, Andrew Thangaraj

arXiv:2605.0799414.2

Predicted impact top 76% in IT · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the problem of data sparsity in language model smoothing by incorporating semantic similarity, offering a principled theoretical framework and practical improvements.

The paper introduces semantic smoothing for language models, which leverages embeddings to share statistical information across semantically similar contexts, and demonstrates consistent test perplexity reduction on synthetic and WikiText-103 data using add-constant and Kneser-Ney estimates.

We propose semantic smoothing, a smoothing method for language models that uses embeddings to share statistical observations across semantically similar contexts. The starting point is a decomposition of log-perplexity that motivates smoothing as a collection of distribution-estimation problems under Kullback-Leibler (KL) loss. We then show that, under a Lipschitz-logit model for embedding-based language generation, proximity of context embeddings implies proximity of the corresponding next-word distributions in KL divergence. Combining these observations, we formulate semantic smoothing as distribution estimation in KL loss with KL-proximity side information. For $n$ samples on a $d$-symbol alphabet with a side-information distribution at KL distance $Δ$, we give an interpolation estimator with worst-case KL risk $O(\min\{Δ,d/n\})$, and prove a matching-order lower bound for uniform side information. We extend the estimator to multiple and empirically estimated synonymous distributions. Experiments on synthetic Markov data and WikiText-103 bigram models using Word2Vec, GloVe, and GPT-2 embeddings show that semantic smoothing consistently reduces test perplexity when applied to add-constant and Kneser-Ney estimates.

View on arXiv PDF

Similar