CLMay 22

Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems

Stefano Cirillo, Domenico Desiato, Giuseppe Polese, Giandomenico Solimando

arXiv:2605.236184.3h-index: 25Has Code

Predicted impact top 63% in CL · last 90 daysOriginality Synthesis-oriented

AI Analysis

Provides a practical benchmark for practitioners choosing between proprietary and open-source multilingual dense retrievers, with concrete latency-accuracy trade-offs.

Google Embeddings 2 (GE2) achieves top retrieval accuracy (BEIR avg nDCG@10=0.638) but is 14x slower than local models; Multilingual-E5-large offers comparable accuracy (within 0.003 nDCG) at 31 ms latency, while LaBSE performs poorly (0.188 avg nDCG@10).

We benchmark Google Embeddings (GE2), a Vertex-AI-hosted bi-encoder with 2,048-token context and explicit task-type conditioning, against five open-source alternatives: BGE-M3, E5-large, Multilingual-E5-large (mE5-L), LaBSE, and Paraphrase-Multilingual-MPNet (mMPNet). Evaluation covers four BEIR subsets, a synthetic Italian RAG corpus, a chunking ablation considering 5 sizes of tokens with three strategies, and per-query latency on commodity CPU hardware. GE2 ranks first on every task, achieving BEIR avg.nDCG@10 = 0.638 and IT-RAG-Bench nDCG@10 = 0.282, but at 231.6 ms median latency, it is roughly 14x slower than the fastest local models. mE5-L reaches within 0.003 nDCG of GE2 on Italian at 31 ms, making it the preferred option when sub-100 ms SLAs matter. A more striking finding concerns LaBSE, which, despite widespread multilingual deployment scores 0.188 average nDCG@10 on BEIR, below every dedicated retrieval model including mMPNet. Chunking experiments show that all six models saturate at 32-token chunks on our corpus, with semantic chunking providing measurable gains only at 16 tokens.

View on arXiv PDF

Similar