CL AIJan 13

BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts

arXiv:2601.08490v11.63 citationsh-index: 8Trans. Mach. Learn. Res.

Originality Incremental advance

AI Analysis

This addresses economic and environmental issues for LLM operators by providing a tool to measure and mitigate overflow, though it is incremental in benchmarking existing models.

The paper tackles the problem of large language models (LLMs) producing excessive outputs from plain-text prompts, termed Overflow, which increases costs and environmental impact. They introduce BenchOverflow, a benchmark showing that a lightweight mitigation reduces overflow risks across models.

We investigate a failure mode of large language models (LLMs) in which plain-text prompts elicit excessive outputs, a phenomenon we term Overflow. Unlike jailbreaks or prompt injection, Overflow arises under ordinary interaction settings and can lead to elevated serving cost, latency, and cross-user performance degradation, particularly when scaled across many requests. Beyond usability, the stakes are economic and environmental: unnecessary tokens increase per-request cost and energy consumption, compounding into substantial operational spend and carbon footprint at scale. Moreover, Overflow represents a practical vector for compute amplification and service degradation in shared environments. We introduce BenchOverflow, a model-agnostic benchmark of nine plain-text prompting strategies that amplify output volume without adversarial suffixes or policy circumvention. Using a standardized protocol with a fixed budget of 5000 new tokens, we evaluate nine open- and closed-source models and observe pronounced rightward shifts and heavy tails in length distributions. Cap-saturation rates (CSR@1k/3k/5k) and empirical cumulative distribution functions (ECDFs) quantify tail risk; within-prompt variance and cross-model correlations show that Overflow is broadly reproducible yet heterogeneous across families and attack vectors. A lightweight mitigation-a fixed conciseness reminder-attenuates right tails and lowers CSR for all strategies across the majority of models. Our findings position length control as a measurable reliability, cost, and sustainability concern rather than a stylistic quirk. By enabling standardized comparison of length-control robustness across models, BenchOverflow provides a practical basis for selecting deployments that minimize resource waste and operating expense, and for evaluating defenses that curb compute amplification without eroding task performance.

View on arXiv PDF

Similar