LG AI SENov 7, 2024

Benchmarking Large Language Models with Integer Sequence Generation Tasks

Daniel O'Malley, Manish Bhattarai, Nishath Rajiv Ranasinghe, Erick Draayer, Javier Santos

arXiv:2411.04372v32.6h-index: 4

Originality Incremental advance

AI Analysis

This provides a rigorous evaluation framework for LLMs in mathematical reasoning, though it is incremental as it builds on existing benchmarking approaches with a new dataset.

The authors tackled the problem of evaluating large language models' mathematical reasoning and code synthesis capabilities by creating a benchmark using 1,000 integer sequence generation tasks from OEIS, with results showing reasoning-specialized models like OpenAI's o-series and Google's Gemini 2.5-pro achieved substantial accuracy improvements over non-reasoning models, but overall performance on hard sequences remained poor.

We present a novel benchmark designed to rigorously evaluate the capabilities of large language models (LLMs) in mathematical reasoning and algorithmic code synthesis tasks. The benchmark comprises integer sequence generation tasks sourced from the Online Encyclopedia of Integer Sequences (OEIS), testing LLMs' abilities to accurately and efficiently generate Python code to compute these sequences without using lookup tables. Our comprehensive evaluation includes leading models from OpenAI (including the specialized reasoning-focused o-series), Anthropic, Meta, and Google across a carefully selected set of 1000 OEIS sequences categorized as ``easy'' or ``hard.'' Half of these sequences are classical sequences from the early days of OEIS and half were recently added to avoid contamination with the models' training data. To prevent models from exploiting memorized sequence values, we introduce an automated cheating detection mechanism that flags usage of lookup tables, validated by comparison with human expert evaluations. Experimental results demonstrate that reasoning-specialized models (o3, o3-mini, o4-mini from OpenAI, and Gemini 2.5-pro from Google) achieve substantial improvements in accuracy over non-reasoning models, especially on more complex tasks. However, overall model performance on the hard sequences is poor, highlighting persistent challenges in algorithmic reasoning. Our benchmark provides important insights into the strengths and limitations of state-of-the-art LLMs, particularly emphasizing the necessity for further advancements to reliably solve complex mathematical reasoning tasks algorithmically.

View on arXiv PDF

Similar