LG AIApr 11, 2025

SortBench: Benchmarking LLMs based on their ability to sort lists

arXiv:2504.08312v111.44 citationsh-index: 2

Originality Synthesis-oriented

AI Analysis

This work addresses a specific weakness in LLMs for sorting tasks, which is incremental as it benchmarks existing models without proposing new methods.

The authors tackled the problem of evaluating LLMs' ability to sort lists by introducing SortBench, a scalable benchmark with varying difficulties, and tested seven state-of-the-art models, finding that even top models like o3-mini struggle with mixed syntax-semantics tasks and faithfulness issues, while test-time reasoning can degrade performance.

Sorting is a tedious but simple task for human intelligence and can be solved fairly easily algorithmically. However, for Large Language Models (LLMs) this task is surprisingly hard, as some properties of sorting are among known weaknesses of LLMs: being faithful to the input data, logical comparisons between values, and strictly differentiating between syntax (used for sorting) and semantics (typically learned by embeddings). Within this paper, we describe the new SortBench benchmark for LLMs that comes with different difficulties and that can be easily scaled in terms of difficulty. We apply this benchmark to seven state-of-the-art LLMs, including current test-time reasoning models. Our results show that while the o3-mini model is very capable at sorting in general, even this can be fooled if strings are defined to mix syntactical and semantical aspects, e.g., by asking to sort numbers written-out as word. Furthermore, all models have problems with the faithfulness to the input of long lists, i.e., they drop items and add new ones. Our results also show that test-time reasoning has a tendency to overthink problems which leads to performance degradation. Finally, models without test-time reasoning like GPT-4o are not much worse than reasoning models.

View on arXiv PDF

Similar