IRCLOct 2, 2025

Study on LLMs for Promptagator-Style Dense Retriever Training

arXiv:2510.02241v11 citationsh-index: 2Has CodeCIKM
Originality Synthesis-oriented
AI Analysis

This provides an incremental solution for practitioners needing accessible and privacy-compliant alternatives for synthetic data generation in domain-specific retrieval tasks.

The study tackled the problem of proprietary LLM dependency in Promptagator-style dense retriever training by showing that open-source LLMs as small as 3B parameters can effectively generate queries for fine-tuning, achieving comparable results to larger models.

Promptagator demonstrated that Large Language Models (LLMs) with few-shot prompts can be used as task-specific query generators for fine-tuning domain-specialized dense retrieval models. However, the original Promptagator approach relied on proprietary and large-scale LLMs which users may not have access to or may be prohibited from using with sensitive data. In this work, we study the impact of open-source LLMs at accessible scales ($\leq$14B parameters) as an alternative. Our results demonstrate that open-source LLMs as small as 3B parameters can serve as effective Promptagator-style query generators. We hope our work will inform practitioners with reliable alternatives for synthetic data generation and give insights to maximize fine-tuning results for domain-specific applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes