Benchmarking Data Science Agents
This work addresses the need for better benchmarking tools for data science agents, which is crucial for researchers and practitioners in AI and data science, though it is incremental in nature.
The paper tackles the challenge of evaluating data science agents by introducing DSEval, a novel evaluation paradigm and benchmarks for assessing performance across the entire data science lifecycle, using a bootstrapped annotation method to streamline dataset preparation and improve evaluation coverage.
In the era of data-driven decision-making, the complexity of data analysis necessitates advanced expertise and tools of data science, presenting significant challenges even for specialists. Large Language Models (LLMs) have emerged as promising aids as data science agents, assisting humans in data analysis and processing. Yet their practical efficacy remains constrained by the varied demands of real-world applications and complicated analytical process. In this paper, we introduce DSEval -- a novel evaluation paradigm, as well as a series of innovative benchmarks tailored for assessing the performance of these agents throughout the entire data science lifecycle. Incorporating a novel bootstrapped annotation method, we streamline dataset preparation, improve the evaluation coverage, and expand benchmarking comprehensiveness. Our findings uncover prevalent obstacles and provide critical insights to inform future advancements in the field.