IRAICLMar 5

DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

arXiv:2603.04743v12 citationsHas Code
Originality Highly original
AI Analysis

This work aims to improve the ability of LLM agents to access and utilize the mature R statistical ecosystem, benefiting data scientists and researchers who rely on R for rigorous statistical analysis.

This paper addresses the challenge of Large Language Model (LLM) agents underutilizing rigorous statistical methods in R due to difficulties with statistical knowledge and tool retrieval. The authors propose DARE, a retrieval model that incorporates data distribution information into function representations, achieving an NDCG@10 of 93.47% and outperforming state-of-the-art open-source embedding models by up to 17% on R package retrieval.

Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore data distribution, producing suboptimal matches. We propose DARE (Distribution-Aware Retrieval Embedding), a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval. Our main contributions are: (i) RPKB, a curated R Package Knowledge Base derived from 8,191 high-quality CRAN packages; (ii) DARE, an embedding model that fuses distributional features with function metadata to improve retrieval relevance; and (iii) RCodingAgent, an R-oriented LLM agent for reliable R code generation and a suite of statistical analysis tasks for systematically evaluating LLM agents in realistic analytical scenarios. Empirically, DARE achieves an NDCG at 10 of 93.47%, outperforming state-of-the-art open-source embedding models by up to 17% on package retrieval while using substantially fewer parameters. Integrating DARE into RCodingAgent yields significant gains on downstream analysis tasks. This work helps narrow the gap between LLM automation and the mature R statistical ecosystem.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes