Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models
This addresses the need for benchmarks to assess autonomous goal-setting in LLMs for data science applications, though it is incremental in focusing on a specific domain.
The paper tackles the problem of evaluating investigatory intelligence in large language models (LLMs) by introducing Deep Data Research (DDR), an open-ended task for autonomously extracting insights from databases, and DDR-Bench, a benchmark for verifiable evaluation. Results show frontier models exhibit emerging agency but struggle with long-horizon exploration, with effectiveness depending on intrinsic strategies rather than just scaling or scaffolding.
The agency expected of Agentic Large Language Models goes beyond answering correctly, requiring autonomy to set goals and decide what to explore. We term this investigatory intelligence, distinguishing it from executional intelligence, which merely completes assigned tasks. Data Science provides a natural testbed, as real-world analysis starts from raw data rather than explicit queries, yet few benchmarks focus on it. To address this, we introduce Deep Data Research (DDR), an open-ended task where LLMs autonomously extract key insights from databases, and DDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation. Results show that while frontier models display emerging agency, long-horizon exploration remains challenging. Our analysis highlights that effective investigatory intelligence depends not only on agent scaffolding or merely scaling, but also on intrinsic strategies of agentic models.