CLAIIRLGMAJun 24, 2025

Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

arXiv:2506.19794v55 citationsh-index: 32Has Code
Originality Incremental advance
AI Analysis

This work addresses limitations in open-source LLMs for data analysis, which is an incremental improvement for users relying on these models for automated reasoning tasks.

The paper tackled the problem of open-source LLMs struggling with data analysis tasks by investigating strategies to enhance their capabilities, resulting in significant improvements in analytical reasoning through a data synthesis methodology.

Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities. Code is available at https://github.com/zjunlp/DataMind.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes