95.2CVMay 28Code
Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI AgentsTianpeng Bu, Xin Liu, Qihua Chen et al.
While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains $1,216$ executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates $800k$ high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a $47.4\%$ success rate and a $33.8\%$ All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at https://github.com/AlibabaResearch/RoTS.
98.7LGMay 27Code
Long Live The Balance: Information Bottleneck Driven Tree-based Policy OptimizationHao Jiang, Shurui Li, Tianpeng Bu et al.
Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at https://github.com/alibaba/EfficientRL.
CVFeb 15, 2023Code
Efficient Teacher: Semi-Supervised Object Detection for YOLOv5Bowen Xu, Mingtao Chen, Wenlong Guan et al.
Semi-Supervised Object Detection (SSOD) has been successful in improving the performance of both R-CNN series and anchor-free detectors. However, one-stage anchor-based detectors lack the structure to generate high-quality or flexible pseudo labels, leading to serious inconsistency problems in SSOD. In this paper, we propose the Efficient Teacher framework for scalable and effective one-stage anchor-based SSOD training, consisting of Dense Detector, Pseudo Label Assigner, and Epoch Adaptor. Dense Detector is a baseline model that extends RetinaNet with dense sampling techniques inspired by YOLOv5. The Efficient Teacher framework introduces a novel pseudo label assignment mechanism, named Pseudo Label Assigner, which makes more refined use of pseudo labels from Dense Detector. Epoch Adaptor is a method that enables a stable and efficient end-to-end semi-supervised training schedule for Dense Detector. The Pseudo Label Assigner prevents the occurrence of bias caused by a large number of low-quality pseudo labels that may interfere with the Dense Detector during the student-teacher mutual learning mechanism, and the Epoch Adaptor utilizes domain and distribution adaptation to allow Dense Detector to learn globally distributed consistent features, making the training independent of the proportion of labeled data. Our experiments show that the Efficient Teacher framework achieves state-of-the-art results on VOC, COCO-standard, and COCO-additional using fewer FLOPs than previous methods. To the best of our knowledge, this is the first attempt to apply Semi-Supervised Object Detection to YOLOv5.Code is available: https://github.com/AlibabaResearch/efficientteacher
CLFeb 2Code
D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool UseBowen Xu, Shaoyu Wu, Hao Jiang et al.
Effective tool use and reasoning are essential capabilities for large reasoning models~(LRMs) to address complex real-world problems. Through empirical analysis, we identify that current LRMs lack the capability of sub-task decomposition in complex tool use scenarios, leading to Lazy Reasoning. To address this, we propose a two-stage training framework D-CORE~(\underline{\textbf{D}}ecomposing tasks and \underline{\textbf{Co}}mposing \underline{\textbf{Re}}asoning processes) that first incentivize the LRMs' task decomposition reasoning capability via self-distillation, followed by diversity-aware reinforcement learning~(RL) to restore LRMs' reflective reasoning capability. D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales. Experiments on BFCLv3 demonstrate superiority of our method: D-CORE-8B reaches 77.7\% accuracy, surpassing the best-performing 8B model by 5.7\%. Meanwhile, D-CORE-14B establishes a new state-of-the-art at 79.3\%, outperforming 70B models despite being 5$\times$ smaller. The source code is available at https://github.com/alibaba/EfficientAI.
CLApr 29, 2024Code
Mixture-of-Instructions: Aligning Large Language Models via Mixture PromptingBowen Xu, Shaoyu Wu, Kai Liu et al.
With the proliferation of large language models (LLMs), the comprehensive alignment of such models across multiple tasks has emerged as a critical area of research. Existing alignment methodologies primarily address single task, such as multi-turn dialogue, coding, mathematical problem-solving, and tool usage. Although there is a large amount of high-quality data available for those tasks, most of them provide only questions and answers without including the system prompt. Though a detailed analysis of the Qwen language model, we found that the system prompt has a significant impact on both training and inference processes of LLM. We attributes this phenomenon to overfitting to the system prompt. In address this issue, we introduce a novel technique termed Mixture-of-Instructions (MoI), which employs a strategy of instruction packing combined with diverse system prompts to boost the alignment efficiency of language models. We have also compiled a diverse set of seven benchmark datasets to rigorously evaluate the alignment efficacy of the MoI-enhanced language model. Our methodology was applied to the open-source Qwen-7B-chat model, culminating in the development of Qwen-SFT-MoI. This enhanced model demonstrates significant advancements in generative capabilities across coding, mathematics, and tool use tasks.
CVMar 5Code
MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language ModelsLulu Hu, Wenhu Xiao, Xin Chen et al.
Post-training quantization (PTQ) with computational invariance for Large Language Models~(LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models~(MLLMs) presents substantial challenges. In this paper, we analyze SmoothQuant as a case study and identify two critical issues: Smoothing Misalignment and Cross-Modal Computational Invariance. To address these issues, we propose Modality-Aware Smoothing Quantization (MASQuant), a novel framework that introduces (1) Modality-Aware Smoothing (MAS), which learns separate, modality-specific smoothing factors to prevent Smoothing Misalignment, and (2) Cross-Modal Compensation (CMC), which addresses Cross-modal Computational Invariance by using SVD whitening to transform multi-modal activation differences into low-rank forms, enabling unified quantization across modalities. MASQuant demonstrates stable quantization performance across both dual-modal and tri-modal MLLMs. Experimental results show that MASQuant is competitive among the state-of-the-art PTQ algorithms. Source code: https://github.com/alibaba/EfficientAI.
CLJul 2, 2025
La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse ActivationKai Liu, Bowen Xu, Shaoyu Wu et al.
Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA (Layerwise Rotated Sparse Activation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40% sparsity, LaRoSA achieves a mere 0.17 perplexity gap with a consistent 1.30x wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0.54%, while surpassing TEAL by 1.77% and CATS by 17.14%.