CLMay 30, 2025

Semi-structured LLM Reasoners Can Be Rigorously Audited

CMU
arXiv:2505.24217v25 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses the issue of faithfulness in LLM reasoning for users needing reliable and transparent AI outputs, representing an incremental improvement by adding auditability to existing models.

The paper tackles the problem of detecting reasoning errors and biases in Large Language Models by introducing Semi-Structured Reasoning Models (SSRMs), which produce semi-structured reasoning traces that can be automatically audited, showing that these audits effectively flag errors without compromising accuracy across twelve benchmarks and two model families.

Although Large Language Models (LLMs) have become capable reasoners, the problem of faithfulness persists: their reasoning can contain errors and omissions that are difficult to detect and that may obscure biases in model outputs. To address this issue, we introduce Semi-Structured Reasoning Models (SSRMs), which are trained to produce semi-structured representations of reasoning. SSRMs generate reasoning traces in a non-executable Pythonic syntax that names each reasoning step and marks its inputs and outputs. This structure allows SSRM traces to be automatically audited to identify reasoning flaws. We evaluate three types of audits: hand-crafted structured reasoning audits, written in a domain-specific language (DSL) implemented in Python; LLM-generated structured reasoning audits; and learned typicality audits, which apply probabilistic models over reasoning traces. We show that all of these methods can be used to effectively flag probable reasoning errors. Importantly, the auditability of SSRMs does not appear to compromise overall accuracy: in evaluation on twelve benchmarks and two model families, SSRMs demonstrate strong performance and generalizability relative to other models of comparable size.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes