Chenglin Wu

h-index30

5papers

156citations

Novelty53%

AI Score48

Ranked #28,830 of 194,257 authors (top 15%)#5,976 in CL (top 19%)

5 Papers

31.4CLFeb 17, 2025Code

Atom of Thoughts for Markov LLM Test-Time Scaling

Fengwei Teng, Zhaoyang Yu, Quan Shi et al.

Large Language Models (LLMs) achieve superior performance through training-time scaling, and test-time scaling further enhances their capabilities by conducting effective reasoning during inference. However, as the scale of reasoning increases, existing test-time scaling methods suffer from accumulated historical information, which not only wastes computational resources but also interferes with effective reasoning. To address this issue, we observe that complex reasoning can be achieved by solving a series of independent and self-contained subquestions. These subquestions are essentially \textit{atomic questions}, exhibiting the memoryless property similar to Markov processes. Based on this observation, we propose Atom of Thoughts (\our), where each state transition consists of decomposing the current question into a dependency-based directed acyclic graph and contracting its subquestions, forming a simplified question that maintains answer equivalence with the original problem. This answer preservation enables the iterative \textit{decomposition-contraction} process to naturally form a meaningful Markov reasoning process. Furthermore, these atomic states can be seamlessly integrated into existing test-time scaling methods, enabling \our to serve as a plug-in enhancement for improving reasoning capabilities. Experiments across six benchmarks demonstrate the effectiveness of \our both as a standalone framework and a plug-in enhancement. Notably, on HotpotQA, when applied to gpt-4o-mini, \our achieves an \textbf{80.6\%} F1 score, surpassing o3-mini by \textbf{3.4\%} and DeepSeek-R1 by \textbf{10.6\%}. The code is available at \href{https://github.com/qixucen/atom}{https://github.com/qixucen/atom}.

28.2CLMar 10, 2025Code

MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

Xiangru Tang, Daniel Shao, Jiwoong Sohn et al.

Large Language Models (LLMs) have shown impressive performance on existing medical question-answering benchmarks. This high performance makes it increasingly difficult to meaningfully evaluate and differentiate advanced methods. We present MedAgentsBench, a benchmark that focuses on challenging medical questions requiring multi-step clinical reasoning, diagnosis formulation, and treatment planning-scenarios where current models still struggle despite their strong performance on standard tests. Drawing from seven established medical datasets, our benchmark addresses three key limitations in existing evaluations: (1) the prevalence of straightforward questions where even base models achieve high performance, (2) inconsistent sampling and evaluation protocols across studies, and (3) lack of systematic analysis of the interplay between performance, cost, and inference time. Through experiments with various base models and reasoning methods, we demonstrate that the latest thinking models, DeepSeek R1 and OpenAI o3, exhibit exceptional performance in complex medical reasoning tasks. Additionally, advanced search-based agent methods offer promising performance-to-cost ratios compared to traditional approaches. Our analysis reveals substantial performance gaps between model families on complex questions and identifies optimal model selections for different computational constraints. Our benchmark and evaluation framework are publicly available at https://github.com/gersteinlab/medagents-benchmark.

27.4CLFeb 7, 2025Code

Self-Supervised Prompt Optimization

Jinyu Xiang, Jiayi Zhang, Zhaoyang Yu et al.

Well-designed prompts are crucial for enhancing Large language models' (LLMs) reasoning capabilities while aligning their outputs with task requirements across diverse domains. However, manually designed prompts require expertise and iterative experimentation. While existing prompt optimization methods aim to automate this process, they rely heavily on external references such as ground truth or by humans, limiting their applicability in real-world scenarios where such data is unavailable or costly to obtain. To address this, we propose Self-Supervised Prompt Optimization (SPO), a cost-efficient framework that discovers effective prompts for both closed and open-ended tasks without requiring external reference. Motivated by the observations that prompt quality manifests directly in LLM outputs and LLMs can effectively assess adherence to task requirements, we derive evaluation and optimization signals purely from output comparisons. Specifically, SPO selects superior prompts through pairwise output comparisons evaluated by an LLM evaluator, followed by an LLM optimizer that aligns outputs with task requirements. Extensive experiments demonstrate that SPO outperforms state-of-the-art prompt optimization methods, achieving comparable or superior results with significantly lower costs (e.g., 1.1% to 5.6% of existing methods) and fewer samples (e.g., three samples). The code is available at https://github.com/FoundationAgents/SPO.

10.9CLSep 17, 2025

Improving Context Fidelity via Native Retrieval-Augmented Reasoning

Suyuchen Wang, Jinlin Wang, Xinyu Wang et al.

Large language models (LLMs) often struggle with context fidelity, producing inconsistent answers when responding to questions based on provided information. Existing approaches either rely on expensive supervised fine-tuning to generate evidence post-answer or train models to perform web searches without necessarily improving utilization of the given context. We propose CARE, a novel native retrieval-augmented reasoning framework that teaches LLMs to explicitly integrate in-context evidence within their reasoning process with the model's own retrieval capabilities. Our method requires limited labeled evidence data while significantly enhancing both retrieval accuracy and answer generation performance through strategically retrieved in-context tokens in the reasoning chain. Extensive experiments on multiple real-world and counterfactual QA benchmarks demonstrate that our approach substantially outperforms supervised fine-tuning, traditional retrieval-augmented generation methods, and external retrieval solutions. This work represents a fundamental advancement in making LLMs more accurate, reliable, and efficient for knowledge-intensive tasks.

5.1CENov 28, 2020

Thermodynamic Consistent Neural Networks for Learning Material Interfacial Mechanics

Jiaxin Zhang, Congjie Wei, Chenglin Wu

For multilayer materials in thin substrate systems, interfacial failure is one of the most challenges. The traction-separation relations (TSR) quantitatively describe the mechanical behavior of a material interface undergoing openings, which is critical to understand and predict interfacial failures under complex loadings. However, existing theoretical models have limitations on enough complexity and flexibility to well learn the real-world TSR from experimental observations. A neural network can fit well along with the loading paths but often fails to obey the laws of physics, due to a lack of experimental data and understanding of the hidden physical mechanism. In this paper, we propose a thermodynamic consistent neural network (TCNN) approach to build a data-driven model of the TSR with sparse experimental data. The TCNN leverages recent advances in physics-informed neural networks (PINN) that encode prior physical information into the loss function and efficiently train the neural networks using automatic differentiation. We investigate three thermodynamic consistent principles, i.e., positive energy dissipation, steepest energy dissipation gradient, and energy conservative loading path. All of them are mathematically formulated and embedded into a neural network model with a novel defined loss function. A real-world experiment demonstrates the superior performance of TCNN, and we find that TCNN provides an accurate prediction of the whole TSR surface and significantly reduces the violated prediction against the laws of physics.