AIAug 2, 2024Code
On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty AgentsJen-tse Huang, Jiaxu Zhou, Tailin Jin et al. · allen-ai, cmu
Large language model-based multi-agent systems have shown great abilities across various tasks due to the collaboration of expert agents, each focusing on a specific domain. However, the impact of clumsy or even malicious agents--those who frequently make errors in their tasks--on the overall performance of the system remains underexplored. This paper investigates: (1) What is the resilience of various system structures (e.g., A$\rightarrow$B$\rightarrow$C, A$\leftrightarrow$B$\leftrightarrow$C) under faulty agents, on different downstream tasks? (2) How can we increase system resilience to defend against these agents? To simulate faulty agents, we propose two approaches--AutoTransform and AutoInject--which introduce mistakes into the agents' responses. Experiments on four downstream tasks using six systems show that the "hierarchical" structure, i.e., A$\rightarrow$(B$\leftrightarrow$C), exhibits superior resilience with the lowest performance drop of 5.5%, compared to 10.5% and 23.7% of other two structures. To further improve resilience, we introduce (1) Challenger, that introduces a mechanism for each agent to challenge others' outputs, and (2) Inspector, an additional agent to review and correct messages, recovering up to 96.4% errors made by faulty agents. Our code and data are available at https://github.com/CUHK-ARISE/MAS-Resilience.
LGNov 10, 2022
What Makes a Good Explanation?: A Harmonized View of Properties of ExplanationsZixi Chen, Varshini Subhash, Marton Havasi et al.
Interpretability provides a means for humans to verify aspects of machine learning (ML) models and empower human+ML teaming in situations where the task cannot be fully automated. Different contexts require explanations with different properties. For example, the kind of explanation required to determine if an early cardiac arrest warning system is ready to be integrated into a care setting is very different from the type of explanation required for a loan applicant to help determine the actions they might need to take to make their application successful. Unfortunately, there is a lack of standardization when it comes to properties of explanations: different papers may use the same term to mean different quantities, and different terms to mean the same quantity. This lack of a standardized terminology and categorization of the properties of ML explanations prevents us from both rigorously comparing interpretable machine learning methods and identifying what properties are needed in what contexts. In this work, we survey properties defined in interpretable machine learning papers, synthesize them based on what they actually measure, and describe the trade-offs between different formulations of these properties. In doing so, we enable more informed selection of task-appropriate formulations of explanation properties as well as standardization for future work in interpretable machine learning.
CVFeb 8, 2023Code
Mask Conditional Synthetic Satellite ImageryVan Anh Le, Varshini Reddy, Zixi Chen et al.
In this paper we propose a mask-conditional synthetic image generation model for creating synthetic satellite imagery datasets. Given a dataset of real high-resolution images and accompanying land cover masks, we show that it is possible to train an upstream conditional synthetic imagery generator, use that generator to create synthetic imagery with the land cover masks, then train a downstream model on the synthetic imagery and land cover masks that achieves similar test performance to a model that was trained with the real imagery. Further, we find that incorporating a mixture of real and synthetic imagery acts as a data augmentation method, producing better models than using only real imagery (0.5834 vs. 0.5235 mIoU). Finally, we find that encouraging diversity of outputs in the upstream model is a necessary component for improved downstream task performance. We have released code for reproducing our work on GitHub, see https://github.com/ms-synthetic-satellite-image/synthetic-satellite-imagery .
74.8CEMay 18Code
Robust Restless Multi-Armed Bandit for Data Center Flexibility Services Through Virtual Machine SchedulingYifu Ding, Zixi Chen, Thomas Magnanti
Energy demands from data centers have surged and stressed the grid in recent years. Electric grids require balancing supply and demand every second, motivating demand response (reduction) from large loads, including data centers. This can be achieved by rescheduling jobs on physical machines. Its real-time implementation is uncertain due to fluctuating resource utilization, and rescheduling incurs quality-of-service (QoS) losses that providers are unwilling to disclose. We propose a restless multi-arm bandit (RMAB) framework in which the grid operator requests load reductions without access to detailed job-rescheduling procedures. Using the open-source virtual machine (VM) datasets, we model job arrivals and rescheduling at each data center as a restless arm in a Markov decision process (MDP), and derive Whittle-index-based policies based on the learned transition function via Thompson sampling. To overcome the weakness of an increasingly long learning process due to an enlarged state space, we used a mixed strategy that included a global upper confidence bound (UCB) encoded with trust indices to enhance robustness and accelerate learning. Results show that the proposed mixed-strategy algorithm remains robust across varying state-space sizes and consistently outperforms the pure Thompson-Whittle (TW) algorithm, especially when contextual information is noisy. It also demonstrates superior performance compared to the state-of-the-art EXP4 framework. We provided an open-sourced code for reproducibility.
CLFeb 6Code
PlugMem: A Task-Agnostic Plugin Memory Module for LLM AgentsKe Yang, Zixi Chen, Xuan He et al.
Long-term memory is essential for large language model (LLM) agents operating in complex environments, yet existing memory designs are either task-specific and non-transferable, or task-agnostic but less effective due to low task-relevance and context explosion from raw memory retrieval. We propose PlugMem, a task-agnostic plugin memory module that can be attached to arbitrary LLM agents without task-specific redesign. Motivated by the fact that decision-relevant information is concentrated as abstract knowledge rather than raw experience, we draw on cognitive science to structure episodic memories into a compact, extensible knowledge-centric memory graph that explicitly represents propositional and prescriptive knowledge. This representation enables efficient memory retrieval and reasoning over task-relevant knowledge, rather than verbose raw trajectories, and departs from other graph-based methods like GraphRAG by treating knowledge as the unit of memory access and organization instead of entities or text chunks. We evaluate PlugMem unchanged across three heterogeneous benchmarks (long-horizon conversational question answering, multi-hop knowledge retrieval, and web agent tasks). The results show that PlugMem consistently outperforms task-agnostic baselines and exceeds task-specific memory designs, while also achieving the highest information density under a unified information-theoretic analysis. Code and data are available at https://github.com/TIMAN-group/PlugMem.
AIFeb 2Code
ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project DevelopmentPengrui Lu, Shiqi Zhang, Yunzhong Hou et al.
Recent coding agents can generate complete codebases from simple prompts, yet existing evaluations focus on issue-level bug fixing and lag behind end-to-end development. We introduce ProjDevBench, an end-to-end benchmark that provides project requirements to coding agents and evaluates the resulting repositories. Combining Online Judge (OJ) testing with LLM-assisted code review, the benchmark evaluates agents on (1) system architecture design, (2) functional correctness, and (3) iterative solution refinement. We curate 20 programming problems across 8 categories, covering both concept-oriented tasks and real-world application scenarios, and evaluate six coding agents built on different LLM backends. Our evaluation reports an overall acceptance rate of 27.38%: agents handle basic functionality and data structures but struggle with complex system design, time complexity optimization, and resource management. Our benchmark is available at https://github.com/zsworld6/projdevbench.
88.1DCMay 7
Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at ScaleTianci Bu, Yuan Lyu, Zixi Chen et al.
Data-parallel (DP) load balancing has emerged as a first-order bottleneck in large-scale LLM serving. When a model is sharded across devices via tensor parallelism (TP) or expert parallelism (EP) and replicated across many DP workers, every decode step ends in a synchronization barrier whose latency is set by the most heavily loaded worker; even modest persistent imbalance across DP workers compounds, step after step, into a substantial fraction of wasted compute. The problem is hard for reasons specific to LLM decoding: assignments are sticky (migrating KV caches has a high cost), per-request loads grow over time, arrivals are non-stationary, and the router must decide within a sub-100\,ms decode budget over hundreds of waiting requests and tens of workers. We present \textbf{BalanceRoute}, a family of practical online routing algorithms that target this bottleneck. The first, \textbf{BR-0}, requires no prediction infrastructure and uses a piecewise-linear F-score that captures the sharp asymmetry between admissions that fill safe margin and those that overflow into the envelope; a two-stage decomposition keeps per-step cost compatible with millisecond-scale scheduling. The second, \textbf{BR-H}, generalizes BR-0 with a short, constant lookahead $H$ and a lightweight termination-classifier interface, extending the F-score to a horizon-discounted form. We deploy BalanceRoute on a 144-NPU cluster and evaluate against vLLM baselines on both a proprietary production trace and the public Azure-2024 trace. Across both workloads, BalanceRoute substantially reduces average DP imbalance and improves end-to-end serving throughput.
LGJan 29
Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM ServingChendong Song, Meixuan Wang, Hang Zhou et al.
Attention-FFN disaggregation (AFD) is an emerging architecture for LLM decoding that separates state-heavy, KV-cache-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step communication. While AFD enables independent scaling of memory and compute resources, its performance is highly sensitive to the Attention/FFN provisioning ratio: mis-sizing induces step-level blocking and costly device idle time. We develop a tractable analytical framework for sizing AFD bundles in an $r$A-$1$F topology, where the key difficulty is that Attention-side work is nonstationary-token context grows and requests are continuously replenished with random lengths-while FFN work is stable given the aggregated batch. Using a probabilistic workload model, we derive closed-form rules for the optimal A/F ratio that maximize average throughput per instance across the system. A trace-calibrated AFD simulator validates the theory: across workloads, the theoretical optimal A/F ratio matches the simulation-optimal within 10%, and consistently reduces idle time.
CLJan 14
OrthoGeoLoRA: Geometric Parameter-Efficient Fine-Tuning for Structured Social Science Concept Retrieval on theWebZeqiang Wang, Xinyue Wu, Chenxi Li et al.
Large language models and text encoders increasingly power web-based information systems in the social sciences, including digital libraries, data catalogues, and search interfaces used by researchers, policymakers, and civil society. Full fine-tuning is often computationally and energy intensive, which can be prohibitive for smaller institutions and non-profit organizations in the Web4Good ecosystem. Parameter-Efficient Fine-Tuning (PEFT), especially Low-Rank Adaptation (LoRA), reduces this cost by updating only a small number of parameters. We show that the standard LoRA update $ΔW = BA^\top$ has geometric drawbacks: gauge freedom, scale ambiguity, and a tendency toward rank collapse. We introduce OrthoGeoLoRA, which enforces an SVD-like form $ΔW = BΣA^\top$ by constraining the low-rank factors to be orthogonal (Stiefel manifold). A geometric reparameterization implements this constraint while remaining compatible with standard optimizers such as Adam and existing fine-tuning pipelines. We also propose a benchmark for hierarchical concept retrieval over the European Language Social Science Thesaurus (ELSST), widely used to organize social science resources in digital repositories. Experiments with a multilingual sentence encoder show that OrthoGeoLoRA outperforms standard LoRA and several strong PEFT variants on ranking metrics under the same low-rank budget, offering a more compute- and parameter-efficient path to adapt foundation models in resource-constrained settings.
56.5ROApr 3
A Flow Matching Framework for Soft-Robot Inverse DynamicsHang Yang, Fangju Yang, Yangming Zhang et al.
Learning the inverse dynamics of soft continuum robots remains challenging due to high-dimensional nonlinearities and complex actuation coupling. Conventional feedback-based controllers often suffer from control chattering due to corrective oscillations, while deterministic regression-based learners struggle to capture the complex nonlinear mappings required for accurate dynamic tracking. Motivated by these limitations, we propose an inverse-dynamics framework for open-loop feedforward control that learns the system's differential dynamics as a generative transport map. Specifically, inverse dynamics is reformulated as a conditional flow-matching problem, and Rectified Flow (RF) is adopted as a lightweight instance to generate physically consistent control inputs rather than conditional averages. Two variants are introduced to further enhance physical consistency: RF-Physical, utilizing a physics-based prior for residual modeling; and RF-FWD, integrating a forward-dynamics consistency loss during flow matching. Extensive evaluations demonstrate that our framework reduces trajectory tracking RMSE by over 50% compared to standard regression baselines (MLP, LSTM, Transformer). The system sustains stable open-loop execution at a peak end-effector velocity of 1.14 m/s with sub-millisecond inference latency (0.995 ms). This work demonstrates flow matching as a robust, high-performance paradigm for learning differential inverse dynamics in soft robotic systems.
OCSep 14, 2025
Convergence Rate in Nonlinear Two-Time-Scale Stochastic Approximation with State (Time)-DependenceZixi Chen, Yumin Xu, Ruixun Zhang
The nonlinear two-time-scale stochastic approximation is widely studied under conditions of bounded variances in noise. Motivated by recent advances that allow for variability linked to the current state or time, we consider state- and time-dependent noises. We show that the Lyapunov function exhibits polynomial convergence rates in both cases, with the rate of polynomial delay depending on the parameters of state- or time-dependent noises. Notably, if the state noise parameters fully approach their limiting value, the Lyapunov function achieves an exponential convergence rate. We provide two numerical examples to illustrate our theoretical findings in the context of stochastic gradient descent with Polyak-Ruppert averaging and stochastic bilevel optimization.
LGDec 5, 2025
Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across ScalesShikai Qiu, Zixi Chen, Hoang Phan et al.
Several recently introduced deep learning optimizers utilizing matrix-level preconditioning have shown promising speedups relative to the current dominant optimizer AdamW, particularly in relatively small-scale experiments. However, efforts to validate and replicate their successes have reported mixed results. To better understand the effectiveness of these optimizers at scale, in this work we investigate how to scale preconditioned optimizers via hyperparameter transfer, building on prior works such as $μ$P. We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of optimizers, including Shampoo, SOAP, and Muon, accounting for the impact of commonly used techniques such as blocking and grafting. We find that scaling the learning rate according to $μ$P improves transfer, but can still suffer from significant finite-width deviations that cause drifting optimal learning rates, which we show can be mitigated by blocking and explicit spectral normalization. For compute-optimal scaling, we find scaling independent weight decay as $1/\mathrm{width}$ is nearly optimal across optimizers. Applying these scaling rules, we show Muon, SOAP and Shampoo consistently achieve near $1.4\times$ speedup over AdamW for training Llama-architecture language models of sizes ranging from $190$M to $1.4$B, whereas the speedup vanishes rapidly with scale under incorrect scaling. Based on these results and further ablations, we argue that studying optimal hyperparameter transfer is essential for reliably comparing optimizers at scale given a realistic tuning budget.
LGAug 20, 2025
Adaptively Robust LLM Inference Optimization under Prediction UncertaintyZixi Chen, Yinyu Ye, Zijie Zhou
We study the problem of optimizing Large Language Model (LLM) inference scheduling to minimize total latency. LLM inference is an online and multi-task service process and also heavily energy consuming by which a pre-trained LLM processes input requests and generates output tokens sequentially. Therefore, it is vital to improve its scheduling efficiency and reduce the power consumption while a great amount of prompt requests are arriving. A key challenge in LLM inference scheduling is that while the prompt length is known upon arrival, the output length, which critically impacts memory usage and processing time, is unknown. To address this uncertainty, we propose algorithms that leverage machine learning to predict output lengths, assuming the prediction provides an interval classification (min-max range) for each request. We first design a conservative algorithm, $\mathcal{A}_{\max}$, which schedules requests based on the upper bound of predicted output lengths to prevent memory overflow. However, this approach is overly conservative: as prediction accuracy decreases, performance degrades significantly due to potential overestimation. To overcome this limitation, we propose $\mathcal{A}_{\min}$, an adaptive algorithm that initially treats the predicted lower bound as the output length and dynamically refines this estimate during inferencing. We prove that $\mathcal{A}_{\min}$ achieves a log-scale competitive ratio. Through numerical simulations, we demonstrate that $\mathcal{A}_{\min}$ often performs nearly as well as the hindsight scheduler, highlighting both its efficiency and robustness in practical scenarios. Moreover, $\mathcal{A}_{\min}$ relies solely on the lower bound of the prediction interval--an advantageous design choice since upper bounds on output length are typically more challenging to predict accurately.
ROMar 12, 2025
Tacchi 2.0: A Low Computational Cost and Comprehensive Dynamic Contact Simulator for Vision-based Tactile SensorsYuhao Sun, Shixin Zhang, Wenzhuang Li et al.
With the development of robotics technology, some tactile sensors, such as vision-based sensors, have been applied to contact-rich robotics tasks. However, the durability of vision-based tactile sensors significantly increases the cost of tactile information acquisition. Utilizing simulation to generate tactile data has emerged as a reliable approach to address this issue. While data-driven methods for tactile data generation lack robustness, finite element methods (FEM) based approaches require significant computational costs. To address these issues, we integrated a pinhole camera model into the low computational cost vision-based tactile simulator Tacchi that used the Material Point Method (MPM) as the simulated method, completing the simulation of marker motion images. We upgraded Tacchi and introduced Tacchi 2.0. This simulator can simulate tactile images, marked motion images, and joint images under different motion states like pressing, slipping, and rotating. Experimental results demonstrate the reliability of our method and its robustness across various vision-based tactile sensors.
HCDec 12, 2024
Predicting Quality of Video Gaming Experience Using Global-Scale Telemetry Data and Federated LearningZhongyang Zhang, Jinhe Wen, Zixi Chen et al.
Frames Per Second (FPS) significantly affects the gaming experience. Providing players with accurate FPS estimates prior to purchase benefits both players and game developers. However, we have a limited understanding of how to predict a game's FPS performance on a specific device. In this paper, we first conduct a comprehensive analysis of a wide range of factors that may affect game FPS on a global-scale dataset to identify the determinants of FPS. This includes player-side and game-side characteristics, as well as country-level socio-economic statistics. Furthermore, recognizing that accurate FPS predictions require extensive user data, which raises privacy concerns, we propose a federated learning-based model to ensure user privacy. Each player and game is assigned a unique learnable knowledge kernel that gradually extracts latent features for improved accuracy. We also introduce a novel training and prediction scheme that allows these kernels to be dynamically plug-and-play, effectively addressing cold start issues. To train this model with minimal bias, we collected a large telemetry dataset from 224 countries and regions, 100,000 users, and 835 games. Our model achieved a mean Wasserstein distance of 0.469 between predicted and ground truth FPS distributions, outperforming all baseline methods.