LGCLIRFeb 11

Diffusion-Pretrained Dense and Contextual Embeddings

arXiv:2602.11151v14 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses retrieval quality and efficiency at scale for real-world search scenarios, representing an incremental improvement with novel method combinations.

The authors introduced pplx-embed, a family of multilingual embedding models using diffusion-pretrained language models with multi-stage contrastive learning for web-scale retrieval, achieving competitive performance on multiple benchmarks including MTEB and setting new records on ConTEB.

In this report, we introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling and a late chunking strategy to better preserve global context across long documents. We release two model types: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark. Beyond public benchmarks, pplx-embed-v1 demonstrates strong performance on our internal evaluation suite, which focuses on real-world, large-scale search scenarios over tens of millions of documents. These results validate the models' effectiveness in production environments where retrieval quality and efficiency are critical at scale.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes