CLMay 21, 2025

Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

arXiv:2505.15045v120 citationsh-index: 32EMNLP
Originality Highly original
AI Analysis

This addresses a fundamental limitation in text embedding models for tasks like document retrieval, offering a novel approach that improves performance on specific benchmarks.

The paper tackles the misalignment between unidirectional attention in autoregressive language models and the bidirectional nature of text embedding tasks by proposing diffusion language models for embeddings, resulting in performance gains of 20% on long-document retrieval, 8% on reasoning-intensive retrieval, and 2% on instruction-following retrieval.

Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes