AIMay 28
LsrIF: Enhancing Logic-Structured Instruction Following of Large Language ModelsQingyu Ren, Qianyu He, Jingwen Chang et al.
Instruction following is critical for large language models, yet real-world instructions often involve multiple constraints with logical structures, such as parallel composition, sequential dependencies, and conditional branching. Existing methods typically construct data by simply combining constraints and aggregate rewards by averaging individual constraint scores during training, overlooking logical dependencies and introducing noisy signals. We propose LsrIF, a training framework for logic-structured instruction following. LsrIF constructs data by organizing atomic constraints into parallel, sequential, conditional, and nested structures, and applies structure-aware reward aggregation aligned with their execution semantics: averaging rewards for parallel constraints, decaying later rewards after early failures in sequential structures, and rewarding only active branches in conditional structures. Experiments show that LsrIF improves instruction following in both in-domain and out-of-domain settings while also benefiting logic reasoning. Further analysis indicates that logic-structured training increases attention to constraint-related tokens and logical connectors, suggesting improved modeling of instruction logic. We will release our data and code for future research.
CLMay 8Code
SEIF: Self-Evolving Reinforcement Learning for Instruction FollowingQingyu Ren, Qianyu He, Jiajie Zhu et al.
Instruction following is a fundamental capability of large language models (LLMs), yet continuously improving this capability remains challenging. Existing methods typically rely either on costly external supervision from humans or strong teacher models, or on self-play training with static-difficulty instructions that cannot evolve as the model's capabilities improve. To address these limitations, we propose SEIF (Self-Evolving Reinforcement Learning for Instruction Following), a self-evolving framework for enhancing the instruction-following ability of LLMs. SEIF forms a closed self-evolution loop that improves the model's instruction-following ability, where instruction difficulty evolution and model capability evolution reinforce each other. SEIF consists of four roles: an Instructor that generates increasingly challenging instructions, a Filter that removes conflicting or invalid instructions to ensure data quality, a Follower that learns to follow evolved instructions, and a Judger that provides reward signals for reinforcement learning. The Instructor and Follower are alternately trained and co-evolve throughout the process. Experiments across multiple model scales and architectures show that SEIF consistently improves instruction-following performance, suggesting strong generality. Further analyses reveal the sources of improvement and identify an effective training strategy for self-evolution on open-ended tasks: sufficient early-stage training to build a solid foundation, followed by moderate late-stage training to mitigate overfitting and achieve better final performance. The code and data are publicly available at https://github.com/Rainier-rq1/SEIF.
CLFeb 24, 2025Code
Order Matters: Investigate the Position Bias in Multi-constraint Instruction FollowingJie Zeng, Qianyu He, Qingyu Ren et al.
Real-world instructions with multiple constraints pose a significant challenge to existing large language models (LLMs). An observation is that the LLMs exhibit dramatic performance fluctuation when disturbing the order of the incorporated constraints. Yet, none of the existing works has systematically investigated this position bias problem in the field of multi-constraint instruction following. To bridge this gap, we design a probing task where we quantitatively measure the difficulty distribution of the constraints by a novel Difficulty Distribution Index (CDDI). Through the experimental results, we find that LLMs are more performant when presented with the constraints in a ``hard-to-easy'' order. This preference can be generalized to LLMs with different architecture or different sizes of parameters. Additionally, we conduct an explanation study, providing an intuitive insight into the correlation between the LLM's attention and constraint orders. Our code and dataset are publicly available at https://github.com/meowpass/PBIF.
CLJan 9, 2025Code
Step-by-Step Mastery: Enhancing Soft Constraint Following Ability of Large Language ModelsQingyu Ren, Jie Zeng, Qianyu He et al.
It is crucial for large language models (LLMs) to follow instructions that involve multiple constraints. However, it is an unexplored area to enhance LLMs' ability to follow soft constraints. To bridge the gap, we initially design a pipeline to construct datasets with high-quality outputs automatically. Additionally, to fully utilize the positive and negative samples generated during the data construction process, we choose Direct Preference Optimization (DPO) as the training method. Furthermore, taking into account the difficulty of soft constraints indicated by the number of constraints, we design a curriculum learning training paradigm based on the constraint quantity. We experimentally evaluate the effectiveness of our methods in improving LLMs' soft constraint following ability and analyze the factors driving the improvements.The datasets and code are publicly available at https://github.com/Rainier-rq/FollowSoftConstraint.
AIAug 19, 2024
Minor DPO reject penalty to increase training robustnessShiming Xie, Hong Chen, Fred Yu et al.
Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task. In the past it uses reinforcement learning from human feedback (RLHF) algorithm to optimize the LLM policy to align with these preferences and not to draft too far from the original model. Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method. Using preference pairs of chosen and reject data, DPO models the relative log probability as implicit reward function and optimize LLM policy using a simple binary cross entropy objective directly. DPO is quite straight forward and easy to be understood. It perform efficiently and well in most cases. In this article, we analyze the working mechanism of $β$ in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification. With these insights, we propose MinorDPO, which is better aligned to the original RL algorithm, and increase the stability of preference optimization process.
CLOct 16, 2025Code
Instructions are all you need: Self-supervised Reinforcement Learning for Instruction FollowingQingyu Ren, Qianyu He, Bowei Zhang et al.
Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at https://github.com/Rainier-rq/verl-if
AISep 27, 2025Code
Your Models Have Thought Enough: Training Large Reasoning Models to Stop OverthinkingJinyi Han, Ying Huang, Ying Liao et al.
Large Reasoning Models (LRMs) have achieved impressive performance on challenging tasks, yet their deep reasoning often incurs substantial computational costs. To achieve efficient reasoning, existing reinforcement learning methods still struggle to construct short reasoning path during the rollout stage, limiting effective learning. Inspired by Evidence Accumulation Models, we find that LRMs have accumulated sufficient information early in reasoning, making further reasoning steps redundant. Based on this insight, we propose Just-Enough Thinking (JET), which trains models to proactively terminate unnecessary reasoning. JET performs trajectory truncation during rollout to expose the model to short, distributionally consistent reasoning paths. Besides, it uses a quality-controlled length reward to better encourage concise reasoning while maintaining correctness. Extensive experiments demonstrate that JET significantly improves reasoning efficiency without sacrificing accuracy. Especially, DeepSeek-Distill-Qwen-1.5B achieves a 4.6% accuracy gain while reducing output length by 46.3% on the Olympiad benchmark. Our code is available in the GitHub.
CLAug 18, 2025Code
A Stitch in Time Saves Nine: Proactive Self-Refinement for Language ModelsJinyi Han, Xinyi Wang, Haiquan Zhao et al.
Recent advances in self-refinement have demonstrated significant potential for improving the outputs of large language models (LLMs) through iterative refinement. However, most existing self-refinement methods rely on a reactive process with a fixed number of iterations, making it difficult to determine the optimal timing and content of refinement based on the evolving generation context. Inspired by the way humans dynamically refine their thoughts during execution, we propose ProActive Self-Refinement (PASR), a novel method that enables LLMs to refine their outputs during the generation process. Unlike methods that regenerate entire responses, PASR proactively decides whether, when, and how to refine based on the model's internal state and evolving context. We conduct extensive experiments on a diverse set of 10 tasks to evaluate the effectiveness of PASR. Experimental results show that PASR significantly enhances problem-solving performance. In particular, on Qwen3-8B, PASR reduces average token consumption by 41.6% compared to standard generation, while also achieving an 8.2% improvement in accuracy. Our code and baselines used in the paper are available on GitHub.
AIAug 4, 2025Code
Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction FollowingQingyu Ren, Qianyu He, Bowei Zhang et al.
Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at https://github.com/Rainier-rq/verl-if.
CLFeb 27, 2025Code
Order Doesn't Matter, But Reasoning Does: Training LLMs with Order-Centric AugmentationQianxi He, Qianyu He, Jiaqing Liang et al.
Logical reasoning is essential for large language models (LLMs) to ensure accurate and coherent inference. However, LLMs struggle with reasoning order variations and fail to generalize across logically equivalent transformations. LLMs often rely on fixed sequential patterns rather than true logical understanding. To address this issue, we introduce an order-centric data augmentation framework based on commutativity in logical reasoning. We first randomly shuffle independent premises to introduce condition order augmentation. For reasoning steps, we construct a directed acyclic graph (DAG) to model dependencies between steps, which allows us to identify valid reorderings of steps while preserving logical correctness. By leveraging order-centric augmentations, models can develop a more flexible and generalized reasoning process. Finally, we conduct extensive experiments across multiple logical reasoning benchmarks, demonstrating that our method significantly enhances LLMs' reasoning performance and adaptability to diverse logical structures. We release our codes and augmented data in https://github.com/qianxiHe147/Order-Centric-Data-Augmentation.
AIAug 20, 2024
Minor SFT loss for LLM fine-tune to increase performance and reduce model deviationShiming Xie, Hong Chen, Fred Yu et al.
Instruct LLM provide a paradigm used in large scale language model to align LLM to human preference. The paradigm contains supervised fine tuning and reinforce learning from human feedback. This paradigm is also used in downstream scenarios to adapt LLM to specific corpora and applications. Comparing to SFT, there are many efforts focused on RLHF and several algorithms being proposed, such as PPO, DPO, IPO, KTO, MinorDPO and etc. Meanwhile most efforts for SFT are focused on how to collect, filter and mix high quality data. In this article with insight from DPO and MinorDPO, we propose a training metric for SFT to measure the discrepancy between the optimized model and the original model, and a loss function MinorSFT that can increase the training effectiveness, and reduce the discrepancy between the optimized LLM and original LLM.