AIJul 25, 2024
Enhancing Agent Learning through World Dynamics ModelingZhiyuan Sun, Haochen Shi, Marc-Alexandre Côté et al. · microsoft-research
Large language models (LLMs) have been increasingly applied to tasks in language understanding and interactive decision-making, with their impressive performance largely attributed to the extensive domain knowledge embedded within them. However, the depth and breadth of this knowledge can vary across domains. Many existing approaches assume that LLMs possess a comprehensive understanding of their environment, often overlooking potential gaps in their grasp of actual world dynamics. To address this, we introduce Discover, Verify, and Evolve (DiVE), a framework that discovers world dynamics from a small number of demonstrations, verifies the accuracy of these dynamics, and evolves new, advanced dynamics tailored to the current situation. Through extensive evaluations, we assess the impact of each component on performance and compare the dynamics generated by DiVE to human-annotated dynamics. Our results show that LLMs guided by DiVE make more informed decisions, achieving rewards comparable to human players in the Crafter environment and surpassing methods that require prior task-specific training in the MiniHack environment.
NAMar 1, 2018
An Arbitrary-Order Discontinuous Galerkin Method with One Unknown Per ElementRuo Li, Pingbing Ming, Zhiyuan Sun et al.
We propose an arbitrary-order discontinuous Galerkin method for second-order elliptic problem on general polygonal mesh with only one degree of freedom per element. This is achieved by locally solving a discrete least-squares over a neighboring element patch. Under a geometrical condition on the element patch, we prove an optimal a priori error estimates for the energy norm and for the L$^2$ norm. The accuracy and the efficiency of the method up to order six on several polygonal meshes are illustrated by a set of benchmark problems.
NAJun 4, 2018
A Discontinuous Galerkin Method by Patch Reconstruction for Biharmonic ProblemRuo Li, Pingbing Ming, Zhiyuan Sun et al.
We propose a new discontinuous Galerkin method based on the least-squares patch reconstruction for the biharmonic problem. We prove the optimal error estimate of the proposed method. The two-dimensional and three-dimensional numerical examples are presented to confirm the accuracy and efficiency of the method with several boundary conditions and several types of polygon meshes and polyhedral meshes.
CLFeb 3Code
TRE: Encouraging Exploration in the Trust RegionChao Huang, Yujing Lu, Quangang Li et al.
Entropy regularization is a standard technique in reinforcement learning (RL) to enhance exploration, yet it yields negligible effects or even degrades performance in Large Language Models (LLMs). We attribute this failure to the cumulative tail risk inherent to LLMs with massive vocabularies and long generation horizons. In such environments, standard global entropy maximization indiscriminately dilutes probability mass into the vast tail of invalid tokens rather than focusing on plausible candidates, thereby disrupting coherent reasoning. To address this, we propose Trust Region Entropy (TRE), a method that encourages exploration strictly within the model's trust region. Extensive experiments across mathematical reasoning (MATH), combinatorial search (Countdown), and preference alignment (HH) tasks demonstrate that TRE consistently outperforms vanilla PPO, standard entropy regularization, and other exploration baselines. Our code is available at https://github.com/WhyChaos/TRE-Encouraging-Exploration-in-the-Trust-Region.
93.4AIMay 27
Thinking as Compression: Your Reasoning Model is Secretly a Context CompressorGuoxin Ma, Yibing Liu, Chengzhengxu Li et al.
Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing methods have shown promise, they typically rely on complex compression modules or compression-specific training, leaving the intrinsic capabilities of LLMs underexplored. In contrast, this work reveals that a thinking model itself can naturally compress long contexts by organizing task-relevant information. We thus derive Thinking as Compression (TaC), a new compression paradigm that treats thinking itself as compressed context. Without relying on specific dedicated compressor, TaC directly prompts the thinking model to generate thinking traces as the shortened context, already outperforming most representative compression methods. Further, given that raw thinking output may struggle with budget control and shortcut behaviors, we introduce Thinking as Compression Constrained (TaC-C), leveraging a simple reward-driven optimization framework to elicit intrinsic thinking as compact and controllable compressed context. Experiments across four long-context QA benchmarks demonstrate that TaC-C consistently outperforms existing baselines. At 4x and 8x compression ratios, it surpasses the strongest competitor by 17.4% and 23.4% in average F1, and by 15.7% and 21.7% in average Exact Match Score (EM), respectively.
NAJan 22, 2019
A finite element method by patch reconstruction for the Stokes problem using mixed formulationsRuo Li, Zhiyuan Sun, Fanyi Yang et al.
In this paper, we develop a patch reconstruction finite element method for the Stokes problem. The weak formulation of the interior penalty discontinuous Galerkin is employed. The proposed method has a great flexibility in velocity-pressure space pairs whose stability properties are confirmed by the inf-sup tests. Numerical examples show the applicability and efficiency of the proposed method.
NADec 12, 2018
A Discontinuous Galerkin Method for the Stokes Equation by Divergence-free Patch ReconstructionRuo Li, Zhiyuan Sun, Zhijian Yang
A discontinuous Galerkin method by patch reconstruction is proposed for Stokes flows. A locally divergence-free reconstruction space is employed as the approximation space, and the interior penalty method is adopted which imposes the normal component penalty terms to cancel out the pressure term. Consequently, the Stokes equation can be solved as an elliptic system instead of a saddle-point problem due to such weak form. The number of degree of freedoms of our method is the same as the number of elements in the mesh for different order of accuracy. The error estimations of the proposed method are given in a classical style, which are then verified by some numerical examples.
58.6AIApr 8
ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-TrainingYu Liang, Liangxin Liu, Longzheng Wang et al.
Generative reward models (GRMs) have emerged as a promising approach for aligning Large Language Models (LLMs) with human preferences by offering greater representational capacity and flexibility than traditional scalar reward models. However, GRMs face two major challenges: reliance on costly human-annotated data restricts scalability, and self-training approaches often suffer from instability and vulnerability to reward hacking. To address these issues, we propose ConsistRM, a self-training framework that enables effective and stable GRM training without human annotations. ConsistRM incorporates the Consistency-Aware Answer Reward, which produces reliable pseudo-labels with temporal consistency, thereby providing more stable model optimization. Moreover, the Consistency-Aware Critique Reward is introduced to assess semantic consistency across multiple critiques and allocates fine-grained and differentiated rewards. Experiments on five benchmark datasets across four base models demonstrate that ConsistRM outperforms vanilla Reinforcement Fine-Tuning (RFT) by an average of 1.5%. Further analysis shows that ConsistRM enhances output consistency and mitigates position bias caused by input order, highlighting the effectiveness of consistency-aware rewards in improving GRMs.
69.4AIApr 8
ReflectRM: Boosting Generative Reward Models via Self-Reflection within a Unified Judgment FrameworkKai Qin, Liangxin Liu, Yu Liang et al.
Reward Models (RMs) are critical components in the Reinforcement Learning from Human Feedback (RLHF) pipeline, directly determining the alignment quality of Large Language Models (LLMs). Recently, Generative Reward Models (GRMs) have emerged as a superior paradigm, offering higher interpretability and stronger generalization than traditional scalar RMs. However, existing methods for GRMs focus primarily on outcome-level supervision, neglecting analytical process quality, which constrains their potential. To address this, we propose ReflectRM, a novel GRM that leverages self-reflection to assess analytical quality and enhance preference modeling. ReflectRM is trained under a unified generative framework for joint modeling of response preference and analysis preference. During inference, we use its self-reflection capability to identify the most reliable analysis, from which the final preference prediction is derived. Experiments across four benchmarks show that ReflectRM consistently improves performance, achieving an average accuracy gain of +3.7 on Qwen3-4B. Further experiments confirm that response preference and analysis preference are mutually reinforcing. Notably, ReflectRM substantially mitigates positional bias, yielding +10.2 improvement compared with leading GRMs and establishing itself as a more stable evaluator.
96.9NAMar 31
A Thermodynamically Consistent High-Order Framework for Staggered Lagrangian HydrodynamicsZhiyuan Sun, Jun Liu, Pei Wang
We present a consistent high-order staggered Lagrangian hydrodynamics framework designed to reconcile an underlying disparity in existing curvilinear formulations: the mismatch between quadrature-based "strong" mass conservation and the discrete degrees of freedom (DOFs) of thermodynamic variables. By mathematically coupling the numerical quadrature rule with the density representation, our approach ensures rigorous point-wise consistency between density, internal energy, and pressure. This synchronization eliminates the ambiguity of equation-of-state (EOS) updates inherent in previous high-order staggered methods. To stabilize the discretization, we develop a high-order generalization of the subzonal pressure method by conceptually enriching the pressure field from the $Q^{m-1}$ to the $Q^m$ finite element space. We prove that evaluating this enriched field using a high-order quadrature rule naturally generates a restorative anti-hourglass force, which exactly recovers the classical $Q^1-P^0$ compatible hydrodynamics algorithm as a limiting case for $m=1$. Furthermore, we introduce a concise, algorithmic formulation of tensor artificial viscosity that streamlines implementation and significantly reduces computational overhead in high-order settings. The resulting framework yields strictly diagonal mass matrices for both momentum and energy equations, enabling highly efficient, fully explicit time integration without global linear solves. Extensive numerical benchmarks, including smooth convergence tests and complex shock-dominated flows, demonstrate that the proposed method achieves optimal high-order accuracy while maintaining superior geometric robustness.
AIMar 5, 2024
OPEx: A Component-Wise Analysis of LLM-Centric Agents in Embodied Instruction FollowingHaochen Shi, Zhiyuan Sun, Xingdi Yuan et al. · microsoft-research
Embodied Instruction Following (EIF) is a crucial task in embodied learning, requiring agents to interact with their environment through egocentric observations to fulfill natural language instructions. Recent advancements have seen a surge in employing large language models (LLMs) within a framework-centric approach to enhance performance in embodied learning tasks, including EIF. Despite these efforts, there exists a lack of a unified understanding regarding the impact of various components-ranging from visual perception to action execution-on task performance. To address this gap, we introduce OPEx, a comprehensive framework that delineates the core components essential for solving embodied learning tasks: Observer, Planner, and Executor. Through extensive evaluations, we provide a deep analysis of how each component influences EIF task performance. Furthermore, we innovate within this space by deploying a multi-agent dialogue strategy on a TextWorld counterpart, further enhancing task performance. Our findings reveal that LLM-centric design markedly improves EIF outcomes, identify visual perception and low-level action execution as critical bottlenecks, and demonstrate that augmenting LLMs with a multi-agent framework further elevates performance.
70.2SEApr 2
Automated Functional Testing for Malleable Mobile Application Driven from User IntentYuying Wang, Kaifeng Huang, Hao Deng et al.
Software malleability allows applications to be easily changed, configured, and adapted even after deployment. While prior work has explored configurable systems, adaptive recommender systems, and malleable GUIs, these approaches are often tailored to specific software and lack generalizability. In this work, we envision per-user malleable mobile applications, where end-users can specify requirements that are automatically implemented via LLM-based code generation. However, realizing this vision requires overcoming the key challenge of designing automated test generation that can reliably verify both the presence and correctness of user-specified functionalities. We propose \tool, a user-requirement-driven GUI test generation framework that incrementally navigates the UI, triggers desired functionalities, and constructs LLM-guided oracles to validate correctness. We build a benchmark spanning six popular mobile applications with both correct and faulty user-requested functionalities, demonstrating that \tool effectively validates per-user features and is practical for real-world deployment. Our work highlights the feasibility of shifting mobile app development from a product-manager-driven to an end-user-driven paradigm.
CLFeb 2
Advancing General-Purpose Reasoning Models with Modular Gradient SurgeryMin Cai, Yu Liang, Longzheng Wang et al.
Reinforcement learning (RL) has played a central role in recent advances in large reasoning models (LRMs), yielding strong gains in verifiable and open-ended reasoning. However, training a single general-purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. Through a systematic study of two widely used strategies, Sequential RL and Mixed RL, we find that both incur substantial cross-domain interference at the behavioral and gradient levels, resulting in limited overall gains. To address these challenges, we introduce **M**odular **G**radient **S**urgery (**MGS**), which resolves gradient conflicts at the module level within the transformer. When applied to Llama and Qwen models, MGS achieves average improvements of 4.3 (16.6\%) and 4.5 (11.1\%) points, respectively, over standard multi-task RL across three representative domains (math, general chat, and instruction following). Further analysis demonstrates that MGS remains effective under prolonged training. Overall, our study clarifies the sources of interference in multi-domain RL and presents an effective solution for training general-purpose LRMs.
LGFeb 10
When Less is More: The LLM Scaling Paradox in Context CompressionRuishan Guo, Yibing Liu, Guoxin Ma et al.
Scaling up model parameters has long been a prevalent training paradigm driven by the assumption that larger models yield superior generation capabilities. However, under lossy context compression in a compressor-decoder setup, we observe a Size-Fidelity Paradox: increasing the compressor size can lessen the faithfulness of reconstructed contexts though training loss decreases. Through extensive experiments across models from 0.6B to 90B, we coin this paradox arising from two dominant factors: 1) knowledge overwriting: larger models increasingly replace source facts with their own prior beliefs, e.g., ``the white strawberry'' $\to$ ``the red strawberry''; and 2) semantic drift: larger models tend to paraphrase or restructure content instead of reproducing it verbatim, e.g., ``Alice hit Bob'' $\to$ ``Bob hit Alice''. By holding model size fixed, we reflect on the emergent properties of compressed context representations. We show that the culprit is not parameter count itself, but the excessive semantic capacity and amplified generative uncertainty that accompany scaling. Specifically, the increased rank of context embeddings facilitates prior knowledge intrusion, whereas higher entropy over token prediction distributions promotes rewriting. Our results complement existing evaluations over context compression paradigm, underpinning a breakdown in scaling laws for faithful preservation in open-ended generation.
MAMar 13, 2025
H2-MARL: Multi-Agent Reinforcement Learning for Pareto Optimality in Hospital Capacity Strain and Human Mobility during EpidemicXueting Luo, Hao Deng, Jihong Yang et al.
The necessity of achieving an effective balance between minimizing the losses associated with restricting human mobility and ensuring hospital capacity has gained significant attention in the aftermath of COVID-19. Reinforcement learning (RL)-based strategies for human mobility management have recently advanced in addressing the dynamic evolution of cities and epidemics; however, they still face challenges in achieving coordinated control at the township level and adapting to cities of varying scales. To address the above issues, we propose a multi-agent RL approach that achieves Pareto optimality in managing hospital capacity and human mobility (H2-MARL), applicable across cities of different scales. We first develop a township-level infection model with online-updatable parameters to simulate disease transmission and construct a city-wide dynamic spatiotemporal epidemic simulator. On this basis, H2-MARL is designed to treat each division as an agent, with a trade-off dual-objective reward function formulated and an experience replay buffer enriched with expert knowledge built. To evaluate the effectiveness of the model, we construct a township-level human mobility dataset containing over one billion records from four representative cities of varying scales. Extensive experiments demonstrate that H2-MARL has the optimal dual-objective trade-off capability, which can minimize hospital capacity strain while minimizing human mobility restriction loss. Meanwhile, the applicability of the proposed model to epidemic control in cities of varying scales is verified, which showcases its feasibility and versatility in practical applications.