CRAIFeb 8, 2025

CryptoX : Compositional Reasoning Evaluation of Large Language Models

arXiv:2502.07813v24 citationsh-index: 13Has Code
Originality Highly original
AI Analysis

This work addresses the problem of compositional reasoning in large language models, which is crucial for their generalization and intelligence emergence, particularly for developers and users of language models.

The authors tackled the problem of evaluating the compositional reasoning capacity of large language models, revealing a significant gap between open-source and closed-source models. Their evaluation framework, CryptoX, showed that closed-source models outperform open-source models, highlighting the need to enhance compositional reasoning capabilities.

The compositional reasoning capacity has long been regarded as critical to the generalization and intelligence emergence of large language models LLMs. However, despite numerous reasoning-related benchmarks, the compositional reasoning capacity of LLMs is rarely studied or quantified in the existing benchmarks. In this paper, we introduce CryptoX, an evaluation framework that, for the first time, combines existing benchmarks and cryptographic, to quantify the compositional reasoning capacity of LLMs. Building upon CryptoX, we construct CryptoBench, which integrates these principles into several benchmarks for systematic evaluation. We conduct detailed experiments on widely used open-source and closed-source LLMs using CryptoBench, revealing a huge gap between open-source and closed-source LLMs. We further conduct thorough mechanical interpretability experiments to reveal the inner mechanism of LLMs' compositional reasoning, involving subproblem decomposition, subproblem inference, and summarizing subproblem conclusions. Through analysis based on CryptoBench, we highlight the value of independently studying compositional reasoning and emphasize the need to enhance the compositional reasoning capabilities of LLMs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes