AIFeb 12Code
CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool UseZhen Zhang, Kaiqiang Song, Xun Wang et al.
AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from an 8B Base model and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau^-Bench, by 10 points on BFCL-V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open-source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards. Code provided by the open-source community: https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.
CLAug 12, 2025Code
Complex Logical Instruction GenerationMian Zhang, Shujian Liu, Sixun Dong et al. · microsoft-research
Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF
MAOct 22, 2025
Communication to Completion: Modeling Collaborative Workflows with Intelligent Multi-Agent CommunicationYiming Lu, Xun Wang, Simin Ma et al.
Teamwork in workspace for complex tasks requires diverse communication strategies, but current multi-agent LLM systems lack systematic frameworks for task oriented communication. We introduce Communication to Completion (C2C), a scalable framework that addresses this gap through two key innovations: (1) the Alignment Factor (AF), a novel metric quantifying agent task alignment that directly impacts work efficiency, and (2) a Sequential Action Framework that integrates stepwise execution with intelligent communication decisions. C2C enables agents to make cost aware communication choices, dynamically improving task understanding through targeted interactions. We evaluated C2C on realistic coding workflows across three complexity tiers and team sizes from 5 to 17 agents, comparing against no communication and fixed steps baselines. The results show that C2C reduces the task completion time by about 40% with acceptable communication costs. The framework completes all tasks successfully in standard configurations and maintains effectiveness at scale. C2C establishes both a theoretical foundation for measuring communication effectiveness in multi-agent systems and a practical framework for complex collaborative tasks.
CLSep 29, 2025
Aligning Multilingual Reasoning with Verifiable Semantics from a High-Resource Expert ModelFahim Faisal, Kaiqiang Song, Song Wang et al.
While reinforcement learning has advanced the reasoning abilities of Large Language Models (LLMs), these gains are largely confined to English, creating a significant performance disparity across languages. To address this, we introduce Pivot-Based Reinforcement Learning with Semantically Verifiable Rewards (PB-RLSVR), a novel framework that enhances multilingual reasoning by circumventing the need for human-annotated data in target languages. Our approach employs a high-performing English LLM as a "pivot" model to generate reference responses for reasoning tasks. A multilingual model is then rewarded based on the semantic equivalence of its responses to the English reference, effectively transferring the pivot model's reasoning capabilities across languages. We investigate several cross-lingual semantic reward functions, including those based on embeddings and machine translation. Extensive experiments on a suite of multilingual reasoning benchmarks show that our method significantly narrows the performance gap between English and other languages, substantially outperforming traditional PPO baselines. Specifically, our PB-RLSVR framework improves the average multilingual performance of Llama-3.1-8B-Instruct and Qwen3-32B by 16.41% and 10.17%, respectively, demonstrating a powerful and data-efficient approach to building truly multilingual reasoning agents.
CLSep 26, 2025
Self-Speculative Biased Decoding for Faster Live TranslationLinxiao Zeng, Haoyun Deng, Kangyuan Shu et al.
Large Language Models (LLMs) have recently demonstrated impressive capabilities in various text generation tasks. However, it remains challenging to use them off-the-shelf in streaming applications (such as live translation), where the output must continually update as the input context expands, while still maintaining a reasonable computational cost to meet the latency requirement. In this work, we reexamine the re-translation approach to simultaneous translation and propose Self-Speculative Biased Decoding, a novel inference paradigm designed to avoid repeatedly generating output from scratch for a consistently growing input stream. We propose using the most recent output as a draft for the current growing input context. During the verification stage, the output will be biased towards the draft token for a higher draft acceptance rate. This strategy not only minimizes flickering that might distract users but also leads to higher speedups. Conventional decoding may take charge from the point of divergence after draft verification and continue until the end condition is met. Unlike existing speculative decoding strategies, our approach eliminates the need for draft computations, making it a model-agnostic and plug-and-play solution for accelerating latency-sensitive streaming applications. Experimental results on simultaneous text-to-text re-translation demonstrate that our approach achieves up to 1.7x speedup compared to conventional auto-regressive re-translation without compromising quality. Additionally, it significantly reduces flickering by 80% by incorporating the display-only mask-k technique.