CLMay 22

Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems

arXiv:2605.236184.3h-index: 25Has Code
Predicted impact top 63% in CL · last 90 daysOriginality Synthesis-oriented
AI Analysis

Provides a practical benchmark for practitioners choosing between proprietary and open-source multilingual dense retrievers, with concrete latency-accuracy trade-offs.

Google Embeddings 2 (GE2) achieves top retrieval accuracy (BEIR avg nDCG@10=0.638) but is 14x slower than local models; Multilingual-E5-large offers comparable accuracy (within 0.003 nDCG) at 31 ms latency, while LaBSE performs poorly (0.188 avg nDCG@10).

We benchmark Google Embeddings (GE2), a Vertex-AI-hosted bi-encoder with 2,048-token context and explicit task-type conditioning, against five open-source alternatives: BGE-M3, E5-large, Multilingual-E5-large (mE5-L), LaBSE, and Paraphrase-Multilingual-MPNet (mMPNet). Evaluation covers four BEIR subsets, a synthetic Italian RAG corpus, a chunking ablation considering 5 sizes of tokens with three strategies, and per-query latency on commodity CPU hardware. GE2 ranks first on every task, achieving BEIR avg.nDCG@10 = 0.638 and IT-RAG-Bench nDCG@10 = 0.282, but at 231.6 ms median latency, it is roughly 14x slower than the fastest local models. mE5-L reaches within 0.003 nDCG of GE2 on Italian at 31 ms, making it the preferred option when sub-100 ms SLAs matter. A more striking finding concerns LaBSE, which, despite widespread multilingual deployment scores 0.188 average nDCG@10 on BEIR, below every dedicated retrieval model including mMPNet. Chunking experiments show that all six models saturate at 32-token chunks on our corpus, with semantic chunking providing measurable gains only at 16 tokens.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes