AISep 2, 2025Code
The Landscape of Agentic Reinforcement Learning for LLMs: A SurveyGuibin Zhang, Hejia Geng, Xiaohang Yu et al.
The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the temporally extended, partially observable Markov decision processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.
LGJan 20, 2025Code
RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?Haotian Xu, Xing Wu, Weinong Wang et al.
Can scaling transform reasoning? In this work, we explore the untapped potential of scaling Long Chain-of-Thought (Long-CoT) data to 1000k samples, pioneering the development of a slow-thinking model, RedStar. Through extensive experiments with various LLMs and different sizes, we uncover the ingredients for specialization and scale for Long-CoT training. Surprisingly, even smaller models show significant performance gains with limited data, revealing the sample efficiency of Long-CoT and the critical role of sample difficulty in the learning process. Our findings demonstrate that Long-CoT reasoning can be effectively triggered with just a few thousand examples, while larger models achieve unparalleled improvements. We also introduce reinforcement learning (RL)-scale training as a promising direction for advancing slow-thinking systems. RedStar shines across domains: on the MATH-Hard benchmark, RedStar-code-math boosts performance from 66.2\% to 81.6\%, and on the USA Math Olympiad (AIME), it solves 46.7\% of problems using only 21k mixed-code-math datasets. In multimodal tasks like GeoQA and MathVista-GEO, RedStar-Geo achieves competitive results with minimal Long-CoT data, outperforming other slow-thinking systems like QvQ-Preview. Compared to QwQ, RedStar strikes the perfect balance between reasoning and generalizability. Our work highlights that, with careful tuning, scaling Long-CoT can unlock extraordinary reasoning capabilities-even with limited dataset and set a new standard for slow-thinking models across diverse challenges. Our data and models are released at https://huggingface.co/RedStar-Reasoning.
CVJul 27, 2022
Fault Detection and Classification of Aerospace Sensors using a VGG16-based Deep Neural NetworkZhongzhi Li, Yunmei Zhao, Jinyi Ma et al.
Compared with traditional model-based fault detection and classification (FDC) methods, deep neural networks (DNN) prove to be effective for the aerospace sensors FDC problems. However, time being consumed in training the DNN is excessive, and explainability analysis for the FDC neural network is still underwhelming. A concept known as imagefication-based intelligent FDC has been studied in recent years. This concept advocates to stack the sensors measurement data into an image format, the sensors FDC issue is then transformed to abnormal regions detection problem on the stacked image, which may well borrow the recent advances in the machine vision vision realm. Although promising results have been claimed in the imagefication-based intelligent FDC researches, due to the low size of the stacked image, small convolutional kernels and shallow DNN layers were used, which hinders the FDC performance. In this paper, we first propose a data augmentation method which inflates the stacked image to a larger size (correspondent to the VGG16 net developed in the machine vision realm). The FDC neural network is then trained via fine-tuning the VGG16 directly. To truncate and compress the FDC net size (hence its running time), we perform model pruning on the fine-tuned net. Class activation mapping (CAM) method is also adopted for explainability analysis of the FDC net to verify its internal operations. Via data augmentation, fine-tuning from VGG16, and model pruning, the FDC net developed in this paper claims an FDC accuracy 98.90% across 4 aircraft at 5 flight conditions (running time 26 ms). The CAM results also verify the FDC net w.r.t. its internal operations.
CVJun 18, 2022
Augmented Imagefication: A Data-driven Fault Detection Method for Aircraft Air Data SensorsHang Zhao, Jinyi Ma, Zhongzhi Li et al.
In this paper, a novel data-driven approach named Augmented Imagefication for Fault detection (FD) of aircraft air data sensors (ADS) is proposed. Exemplifying the FD problem of aircraft air data sensors, an online FD scheme on edge device based on deep neural network (DNN) is developed. First, the aircraft inertial reference unit measurements is adopted as equivalent inputs, which is scalable to different aircraft/flight cases. Data associated with 6 different aircraft/flight conditions are collected to provide diversity (scalability) in the training/testing database. Then Augmented Imagefication is proposed for the DNN-based prediction of flying conditions. The raw data are reshaped as a grayscale image for convolutional operation, and the necessity of augmentation is analyzed and pointed out. Different kinds of augmented method, i.e. Flip, Repeat, Tile and their combinations are discussed, the result shows that the All Repeat operation in both axes of image matrix leads to the best performance of DNN. The interpretability of DNN is studied based on Grad-CAM, which provide a better understanding and further solidifies the robustness of DNN. Next the DNN model, VGG-16 with augmented imagefication data is optimized for mobile hardware deployment. After pruning of DNN, a lightweight model (98.79% smaller than original VGG-16) with high accuracy (slightly up by 0.27%) and fast speed (time delay is reduced by 87.54%) is obtained. And the hyperparameters optimization of DNN based on TPE is implemented and the best combination of hyperparameters is determined (learning rate 0.001, iterative epochs 600, and batch size 100 yields the highest accuracy at 0.987). Finally, a online FD deployment based on edge device, Jetson Nano, is developed and the real time monitoring of aircraft is achieved. We believe that this method is instructive for addressing the FD problems in other similar fields.
AIFeb 15, 2024
GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-SolvingJiaxin Zhang, Zhongzhi Li, Mingliang Zhang et al.
Recent advancements in large language models (LLMs) and multi-modal models (MMs) have demonstrated their remarkable capabilities in problem-solving. Yet, their proficiency in tackling geometry math problems, which necessitates an integrated understanding of both textual and visual information, has not been thoroughly evaluated. To address this gap, we introduce the GeoEval benchmark, a comprehensive collection that includes a main subset of 2,000 problems, a 750 problems subset focusing on backward reasoning, an augmented subset of 2,000 problems, and a hard subset of 300 problems. This benchmark facilitates a deeper investigation into the performance of LLMs and MMs in solving geometry math problems. Our evaluation of ten LLMs and MMs across these varied subsets reveals that the WizardMath model excels, achieving a 55.67\% accuracy rate on the main subset but only a 6.00\% accuracy on the hard subset. This highlights the critical need for testing models against datasets on which they have not been pre-trained. Additionally, our findings indicate that GPT-series models perform more effectively on problems they have rephrased, suggesting a promising method for enhancing model capabilities.
CLAug 19, 2025
Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVRXiao Liang, Zhongzhi Li, Yeyun Gong et al.
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.
ROMay 22, 2025
ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language ModelsZirui Song, Guangxian Ouyang, Mingzhe Li et al.
Large Vision-Language Models (LVLMs) have recently advanced robotic manipulation by leveraging vision for scene perception and language for instruction following. However, existing methods rely heavily on costly human-annotated training datasets, which limits their generalization and causes them to struggle in out-of-domain (OOD) scenarios, reducing real-world adaptability. To address these challenges, we propose ManipLVM-R1, a novel reinforcement learning framework that replaces traditional supervision with Reinforcement Learning using Verifiable Rewards (RLVR). By directly optimizing for task-aligned outcomes, our method enhances generalization and physical reasoning while removing the dependence on costly annotations. Specifically, we design two rule-based reward functions targeting key robotic manipulation subtasks: an Affordance Perception Reward to enhance localization of interaction regions, and a Trajectory Match Reward to ensure the physical plausibility of action paths. These rewards provide immediate feedback and impose spatial-logical constraints, encouraging the model to go beyond shallow pattern matching and instead learn deeper, more systematic reasoning about physical interactions.
AIMay 17, 2025
LifelongAgentBench: Evaluating LLM Agents as Lifelong LearnersJunhao Zheng, Xidi Cai, Qiuke Li et al.
Lifelong learning is essential for intelligent agents operating in dynamic environments. Current large language model (LLM)-based agents, however, remain stateless and unable to accumulate or transfer knowledge over time. Existing benchmarks treat agents as static systems and fail to evaluate lifelong learning capabilities. We present LifelongAgentBench, the first unified benchmark designed to systematically assess the lifelong learning ability of LLM agents. It provides skill-grounded, interdependent tasks across three interactive environments, Database, Operating System, and Knowledge Graph, with automatic label verification, reproducibility, and modular extensibility. Extensive experiments reveal that conventional experience replay has limited effectiveness for LLM agents due to irrelevant information and context length constraints. We further introduce a group self-consistency mechanism that significantly improves lifelong learning performance. We hope LifelongAgentBench will advance the development of adaptive, memory-capable LLM agents.
LGMar 29, 2024
TDANet: A Novel Temporal Denoise Convolutional Neural Network With Attention for Fault DiagnosisZhongzhi Li, Rong Fan, Jingqi Tu et al.
Fault diagnosis plays a crucial role in maintaining the operational integrity of mechanical systems, preventing significant losses due to unexpected failures. As intelligent manufacturing and data-driven approaches evolve, Deep Learning (DL) has emerged as a pivotal technique in fault diagnosis research, recognized for its ability to autonomously extract complex features. However, the practical application of current fault diagnosis methods is challenged by the complexity of industrial environments. This paper proposed the Temporal Denoise Convolutional Neural Network With Attention (TDANet), designed to improve fault diagnosis performance in noise environments. This model transforms one-dimensional signals into two-dimensional tensors based on their periodic properties, employing multi-scale 2D convolution kernels to extract signal information both within and across periods. This method enables effective identification of signal characteristics that vary over multiple time scales. The TDANet incorporates a Temporal Variable Denoise (TVD) module with residual connections and a Multi-head Attention Fusion (MAF) module, enhancing the saliency of information within noisy data and maintaining effective fault diagnosis performance. Evaluation on two datasets, CWRU (single sensor) and Real aircraft sensor fault (multiple sensors), demonstrates that the TDANet model significantly outperforms existing deep learning approaches in terms of diagnostic accuracy under noisy environments.
LGSep 29, 2025
Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical ReasoningZelin Tan, Hejia Geng, Mulei Zhang et al.
While scaling laws for large language models (LLMs) during pre-training have been extensively studied, their behavior under reinforcement learning (RL) post-training remains largely unexplored. This paper presents a systematic empirical investigation of scaling behaviors in RL-based post-training, with a particular focus on mathematical reasoning. Based on 54 experiments across diverse model sizes and training settings, we characterize how model scale, data volume, and computational budget interact to shape performance. Our analysis leads to four key findings: (1). Under a fixed computational budget, larger models trained for fewer steps consistently outperform smaller models trained for more steps. (2). Given a fixed amount of training data, larger models achieve superior sample efficiency, yielding lower loss. (3). In data-constrained regimes, repeated reuse of high-quality data proves highly effective, as final performance is primarily governed by the total number of optimization steps rather than the uniqueness of samples. (4). These scaling behaviors are robust across both base and instruction-tuned models, which share similar learning dynamics (e.g., larger models show faster convergence) even while differing in absolute accuracy. Collectively, these results provide a principled foundation and practical guidelines for efficiently scaling the reasoning capabilities of LLMs through RL post-training.
CLFeb 11
Less is Enough: Synthesizing Diverse Data in Feature Space of LLMsZhongzhi Li, Xuansheng Wu, Yijiang Li et al.
The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.
CLSep 19, 2025
LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMsJunlong Jia, Xing Wu, Chaochen Gao et al.
High-quality long-context data is essential for training large language models (LLMs) capable of processing extensive documents, yet existing synthesis approaches using relevance-based aggregation face challenges of computational efficiency. We present LiteLong, a resource-efficient method for synthesizing long-context data through structured topic organization and multi-agent debate. Our approach leverages the BISAC book classification system to provide a comprehensive hierarchical topic organization, and then employs a debate mechanism with multiple LLMs to generate diverse, high-quality topics within this structure. For each topic, we use lightweight BM25 retrieval to obtain relevant documents and concatenate them into 128K-token training samples. Experiments on HELMET and Ruler benchmarks demonstrate that LiteLong achieves competitive long-context performance and can seamlessly integrate with other long-dependency enhancement methods. LiteLong makes high-quality long-context data synthesis more accessible by reducing both computational and data engineering costs, facilitating further research in long-context language training.
LGFeb 20, 2024
Scalable and reliable deep transfer learning for intelligent fault detection via multi-scale neural processes embedded with knowledgeZhongzhi Li, Jingqi Tu, Jiacheng Zhu et al.
Deep transfer learning (DTL) is a fundamental method in the field of Intelligent Fault Detection (IFD). It aims to mitigate the degradation of method performance that arises from the discrepancies in data distribution between training set (source domain) and testing set (target domain). Considering the fact that fault data collection is challenging and certain faults are scarce, DTL-based methods face the limitation of available observable data, which reduces the detection performance of the methods in the target domain. Furthermore, DTL-based methods lack comprehensive uncertainty analysis that is essential for building reliable IFD systems. To address the aforementioned problems, this paper proposes a novel DTL-based method known as Neural Processes-based deep transfer learning with graph convolution network (GTNP). Feature-based transfer strategy of GTNP bridges the data distribution discrepancies of source domain and target domain in high-dimensional space. Both the joint modeling based on global and local latent variables and sparse sampling strategy reduce the demand of observable data in the target domain. The multi-scale uncertainty analysis is obtained by using the distribution characteristics of global and local latent variables. Global analysis of uncertainty enables GTNP to provide quantitative values that reflect the complexity of methods and the difficulty of tasks. Local analysis of uncertainty allows GTNP to model uncertainty (confidence of the fault detection result) at each sample affected by noise and bias. The validation of the proposed method is conducted across 3 IFD tasks, consistently showing the superior detection performance of GTNP compared to the other DTL-based methods.