Skill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study

arXiv:2605.3140838.4

AI Analysis

This research provides insights for developers and researchers on the importance of providing skills to improve the performance of large language model agents, highlighting that the mere availability of skills is more critical than their presentation granularity in this controlled setting.

This study investigates the impact of skill availability and presentation granularity on large language model agents' task success. It found that providing skills significantly increased task-mean pass rates by 26.7 to 36.0 percentage points for GPT-5.5 and 18.0 to 26.0 percentage points for DeepSeek V4-Flash, compared to no skills. However, changes in skill presentation granularity had small, uncertain, and model-dependent effects.

Skill documents provide procedural knowledge to large-language-model agents at inference time. This article studies whether the presentation granularity of controlled skill knowledge changes downstream task success. The experiment uses a pinned SkillsBench version, a 30-task domain-balanced subset validated by official oracle runs, two reasoning-enabled model configurations, six skill conditions, and five trials per task-condition-model cell. Skill availability is the clearest empirical signal. Relative to no skill, skill conditions increase task-mean pass rate by 26.7 to 36.0 percentage points for GPT-5.5 and by 18.0 to 26.0 percentage points for DeepSeek V4-Flash. The final data contain 1,800 rows, with 900 rows for each model. The task is the inference unit. Five trials are aggregated within each task-condition-model cell before paired contrasts are estimated over 30 tasks. The primary presentation contrasts are smaller and uncertain. Low-abstraction guidance differs from high-abstraction guidance by +0.7 percentage points for GPT-5.5 and -6.7 percentage points for DeepSeek V4-Flash, with both 95% bootstrap confidence intervals crossing zero. Adding one worked example to medium-abstraction guidance differs from the no-example variant by +0.7 and +1.3 percentage points. Mean-reward robustness checks preserve the same substantive conclusion. In this controlled subset, skill availability is associated with higher success than no skill, while the tested presentation-granularity changes yield small, uncertain, and model-dependent effects.

View on arXiv PDF

Similar