CLMar 18, 2024

Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models

arXiv:2403.11802v516.429 citationsh-index: 7Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of evaluating long-context capabilities in LLMs for researchers and developers, though it is incremental as it builds on existing benchmark efforts.

The paper tackles the lack of benchmarks for evaluating long-context large language models by introducing Counting-Stars, a multi-evidence, position-aware, and scalable benchmark, and finds that Gemini 1.5 Pro achieves the best overall results while GPT-4 Turbo shows the most stable performance across tasks.

Despite recent efforts to develop large language models with robust long-context capabilities, the lack of long-context benchmarks means that relatively little is known about their performance. To alleviate this gap, in this paper, we propose \textbf{Counting-Stars}, a multi-evidence, position-aware, and scalable benchmark designed to evaluate the multi-evidence retrieval capabilities of long-context LLMs. \textbf{Counting-Stars} comprises two counting-based multiple pieces of evidence retrieval sub-tasks: searching and reasoning. Using Counting-Stars, we conduct experiments to evaluate several long-context LLMs, including GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4, and Moonshot-v1. Extensive experimental results demonstrate that Gemini 1.5 Pro achieves the best overall results, while GPT-4 Turbo exhibits the most stable performance across various tasks. Furthermore, our analysis of these LLMs, which have been extended to handle long-context scenarios, indicates that significant room for improvement remains as the length of the input context and the complexity of the tasks increase.

View on arXiv PDF Code

Similar