CL AI LGSep 29, 2025

Your thoughts tell who you are: Characterize the reasoning patterns of LRMs

Yida Chen, Yuning Mao, Xianjun Yang, Suyu Ge, Shengjie Bi, Lijuan Liu, Saghar Hosseini, Liang Tan, Yixin Nie, Shaoliang Nie

Harvard

arXiv:2509.24147v19.63 citationsh-index: 14Has Code

Originality Incremental advance

AI Analysis

This provides a tool for researchers and practitioners to characterize and compare reasoning patterns in AI models, with incremental improvements in model performance.

The paper tackled the problem of understanding how large reasoning models (LRMs) reason differently by introducing the LLM-proposed Open Taxonomy (LOT) method, which achieved 80-100% accuracy in distinguishing reasoning traces and improved smaller models' accuracy on GPQA by 3.3-5.7% when aligning their reasoning style with larger models.

Current comparisons of large reasoning models (LRMs) focus on macro-level statistics such as task accuracy or reasoning length. Whether different LRMs reason differently remains an open question. To address this gap, we introduce the LLM-proposed Open Taxonomy (LOT), a classification method that uses a generative language model to compare reasoning traces from two LRMs and articulate their distinctive features in words. LOT then models how these features predict the source LRM of a reasoning trace based on their empirical distributions across LRM outputs. Iterating this process over a dataset of reasoning traces yields a human-readable taxonomy that characterizes how models think. We apply LOT to compare the reasoning of 12 open-source LRMs on tasks in math, science, and coding. LOT identifies systematic differences in their thoughts, achieving 80-100% accuracy in distinguishing reasoning traces from LRMs that differ in scale, base model family, or objective domain. Beyond classification, LOT's natural-language taxonomy provides qualitative explanations of how LRMs think differently. Finally, in a case study, we link the reasoning differences to performance: aligning the reasoning style of smaller Qwen3 models with that of the largest Qwen3 during test time improves their accuracy on GPQA by 3.3-5.7%.

View on arXiv PDF

Similar