LGCLOct 17, 2025

Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential

arXiv:2510.15216v2h-index: 14
Originality Incremental advance
AI Analysis

This provides a practical metric for selecting or designing stronger base models for reasoning tasks, though it is incremental as it builds on existing methods like cross-layer sparse autoencoders and reinforcement learning with verifiable rewards.

The paper tackled the problem of predicting large language models' reasoning potential after reinforcement learning with verifiable rewards by identifying a microscopic property, the Soundness-Aware Level (SAL), which measures how well models distinguish between sound and unsound internal rules, achieving an R^2 of 0.87 in predicting performance across diverse models and scales.

Reinforcement learning with verifiable rewards (RLVR) can elicit strong reasoning in large language models (LLMs), while their performance after RLVR varies dramatically across different base models. This raises a fundamental question: what microscopic property of pre-trained models leads to this variation? To investigate, we formalize reasoning as chains of Horn clauses ("if-then" rules) built from features extracted from the LLM's latent space via cross-layer sparse autoencoders (SAEs). We estimate the transition probabilities between its features, and further categorize each rule by its semantic soundness level (e.g., strict, plausible, noisy) with an LLM. Our key discovery is that high-potential models are inherently soundness-aware: their internal probability distributions systematically shift across rules' soundness levels, becoming highly distinct for "strict" versus "noisy" rules. In contrast, weaker models are soundness-agnostic, collapsing to one distribution regardless of soundness levels. To quantify this, we introduce the Soundness-Aware Level (SAL), a microscopic metric using the Jensen-Shannon Divergence to measure the separation between these distributions. We show that SAL's predictions of post-RLVR reasoning performance follow a precise empirical law (R^2=0.87) across diverse model families (Qwen, Mistral, Llama, DeepSeek) and scales (0.5B-14B). This reveals that a model's reasoning potential is tied to its intrinsic, pre-trained ability to distinguish sound knowledge from unsound ones. These findings underscore the critical role of model pre-training in shaping reasoning and offer a practical metric grounded in the model's internal mechanisms for selecting/designing stronger base models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes