NI AI LGJun 3, 2025

NetPress: Dynamically Generated LLM Benchmarks for Network Applications

Yajie Zhou, Jiajun Ruan, Eric S. Wang, Sadjad Fouladi, Francis Y. Yan, Kevin Hsieh, Zaoxing Liu

arXiv:2506.03231v14.34 citationsh-index: 10Has Code

Originality Incremental advance

AI Analysis

This addresses the need for more realistic and scalable evaluation of LLM agents in high-stakes network operations, though it is incremental as it builds on existing benchmarking concepts with domain-specific automation.

The authors tackled the problem of limited static benchmarks for evaluating LLM agents in network applications by developing NetPress, an automated framework that dynamically generates millions of queries with ground truths and integrates with network emulators for realistic testing. They demonstrated NetPress on three applications, revealing fine-grained behavioral differences that static benchmarks miss.

Despite growing interest in domain-specific benchmarking of large language models (LLMs) and agents, current evaluations remain limited to static, small-scale datasets, especially in high-stakes tasks like network operations that demand reliability for deployments. We present NetPress, an automated benchmark generation framework for evaluating LLM agents in network applications. NetPress introduces a unified abstraction with state and action, enabling dynamic generation of diverse query sets along with corresponding ground truths. At runtime, users can specify benchmark configurations to generate millions of queries on the fly. In addition to dynamic benchmark construction, NetPress integrates with network emulators to provide realistic environment feedback, supporting comprehensive evaluation across correctness, safety, and latency. We instantiate NetPress on three representative applications, revealing interesting fine-grained differences in agent behavior that static, correctness-only benchmarks often miss. NetPress moves LLM evaluation toward realistic, scalable testing in infrastructure-centric domains, helping close the gap between benchmark performance and real-world deployment readiness. Code is available at https://github.com/Froot-NetSys/NetPress.

View on arXiv PDF Code

Similar