AIDBSep 28, 2025

LLM/Agent-as-Data-Analyst: A Survey

arXiv:2509.23988v319 citationsh-index: 30
Originality Synthesis-oriented
AI Analysis

This is an incremental survey that synthesizes existing research on LLM/agent-based data analysis for researchers and practitioners.

This survey tackles the problem of data analysis tasks by reviewing how large language models (LLMs) and agent techniques enable complex data understanding, natural language interfaces, and autonomous pipeline orchestration, highlighting their substantial impact across academia and industry.

Large language models (LLMs) and agent techniques have brought a fundamental shift in the functionality and development paradigm of data analysis tasks (a.k.a LLM/Agent-as-Data-Analyst), demonstrating substantial impact across both academia and industry. In comparison with traditional rule or small-model based approaches, (agentic) LLMs enable complex data understanding, natural language interfaces, semantic analysis functions, and autonomous pipeline orchestration. From a modality perspective, we review LLM-based techniques for (i) structured data (e.g., NL2SQL, NL2GQL, ModelQA), (ii) semi-structured data (e.g., markup languages understanding, semi-structured table question answering), (iii) unstructured data (e.g., chart understanding, text/image document understanding), and (iv) heterogeneous data (e.g., data retrieval and modality alignment in data lakes). The technical evolution further distills four key design goals for intelligent data analysis agents, namely semantic-aware design, autonomous pipelines, tool-augmented workflows, and support for open-world tasks. Finally, we outline the remaining challenges and propose several insights and practical directions for advancing LLM/Agent-powered data analysis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes