OckBench: Measuring the Efficiency of LLM Reasoning
This work addresses the need for efficiency-aware evaluation in AI systems, particularly for developers and researchers using LLMs, though it is incremental as it builds on existing benchmarking by adding a new metric.
The authors tackled the problem that existing benchmarks for large language models ignore decoding token efficiency, which affects latency, cost, and energy, by introducing OckBench to evaluate both accuracy and token count for reasoning and coding tasks, revealing that models with similar accuracy can vary widely in token consumption.
Large language models such as GPT-4, Claude 3, and the Gemini series have improved automated reasoning and code generation. However, existing benchmarks mainly focus on accuracy and output quality, and they ignore an important factor: decoding token efficiency. In real systems, generating 10,000 tokens versus 100,000 tokens leads to large differences in latency, cost, and energy. In this work, we introduce OckBench, a model-agnostic and hardware-agnostic benchmark that evaluates both accuracy and token count for reasoning and coding tasks. Through experiments comparing multiple open- and closed-source models, we uncover that many models with comparable accuracy differ wildly in token consumption, revealing that efficiency variance is a neglected but significant axis of differentiation. We further demonstrate Pareto frontiers over the accuracy-efficiency plane and argue for an evaluation paradigm shift: we should no longer treat tokens as "free" to multiply. OckBench provides a unified platform for measuring, comparing, and guiding research in token-efficient reasoning. Our benchmarks are available at https://ockbench.github.io/ .