Zekun Xu

LG
h-index10
14papers
911citations
Novelty58%
AI Score56

14 Papers

LGAug 31, 2025Code
RLFactory: A Plug-and-Play Reinforcement Learning Post-Training Framework for LLM Multi-Turn Tool-Use

Jiajun Chai, Guojun Yin, Zekun Xu et al.

Large language models excel at basic reasoning but struggle with tasks that require interaction with external tools. We present RLFactory, a plug-and-play reinforcement learning post-training framework for multi-round tool use. RLFactory tackles (i) tool-call stability and adaptability amid tool heterogeneity and interface issues via an asyncio-based asynchronous caller and a decoupled tool/training architecture, and (ii) diverse evaluation needs via a reward layer supporting rule-based, model-judgment, and tool-verification signals. It reconstructs the MDP by introducing observation markers from tool feedback, closing the loop among model, tools, and environment, and implements a generate-parse-invoke-update workflow for dynamic policy optimization. On Search-R1 with Qwen3-4B, RLFactory achieves a 0.486 test score on the Natural Questions (NQ) dataset, surpassing larger models trained with similar techniques (e.g., Qwen2.5-7B-Instruct-GRPO at 0.473), and increases training throughput by 6.8x. RLFactory provides a low-barrier, highly adaptable framework for strengthening multi-round tool use of LLMs in real-world scenarios. Code: https://github.com/Simple-Efficient/RL-Factory.

CLNov 11, 2025
From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory

Siyu Xia, Zekun Xu, Jiajun Chai et al.

Large Language Models (LLMs) based agents have demonstrated remarkable potential in autonomous task-solving across complex, open-ended environments. A promising approach for improving the reasoning capabilities of LLM agents is to better utilize prior experiences in guiding current decisions. However, LLMs acquire experience either through implicit memory via training, which suffers from catastrophic forgetting and limited interpretability, or explicit memory via prompting, which lacks adaptability. In this paper, we introduce a novel agent-centric, trainable, multi-layered graph memory framework and evaluate how context memory enhances the ability of LLMs to utilize parametric information. The graph abstracts raw agent trajectories into structured decision paths in a state machine and further distills them into high-level, human-interpretable strategic meta-cognition. In order to make memory adaptable, we propose a reinforcement-based weight optimization procedure that estimates the empirical utility of each meta-cognition based on reward feedback from downstream tasks. These optimized strategies are then dynamically integrated into the LLM agent's training loop through meta-cognitive prompting. Empirically, the learnable graph memory delivers robust generalization, improves LLM agents' strategic reasoning performance, and provides consistent benefits during Reinforcement Learning (RL) training.

CVDec 5, 2025Code
Training Multi-Image Vision Agents via End2End Reinforcement Learning

Chengqi Dong, Chuhuai Yue, Hang He et al.

Recent VLM-based agents aim to replicate OpenAI O3's ``thinking with images" via tool use, but most open-source methods limit input to a single image, falling short on real-world multi-image QA tasks. To address this, we propose IMAgent, an open-source vision agent trained via end-to-end reinforcement learning dedicated for complex multi-image tasks. By leveraging a multi-agent system, we generate challenging and visually-rich multi-image QA pairs to fully activate the tool-use potential of the base VLM. Through manual verification, we obtain MIFG-QA, comprising 10k samples for training and evaluation. With deeper reasoning steps, VLMs may increasingly ignore visual inputs. We therefore develop two specialized tools for visual reflection and confirmation, allowing the model to proactively reallocate its attention to image content during inference. Benefiting from our well-designed action-trajectory two-level mask strategy, IMAgent achieves stable tool use behavior via pure RL training without requiring costly supervised fine-tuning data. Extensive experiments demonstrate that IMAgent maintains strong performance on existing single-image benchmarks while achieving substantial improvements on our proposed multi-image dataset, with our analysis providing actionable insights for the research community. Codes and data will be released soon.

CVSep 28, 2025Code
AutoPrune: Each Complexity Deserves a Pruning Policy

Hanshi Wang, Yuhao Xu, Zekun Xu et al.

The established redundancy in visual tokens within large vision-language models allows pruning to effectively reduce their substantial computational demands. Previous methods typically employ heuristic layer-specific pruning strategies where, although the number of tokens removed may differ across decoder layers, the overall pruning schedule is fixed and applied uniformly to all input samples and tasks, failing to align token elimination with the model's holistic reasoning trajectory. Cognitive science indicates that human visual processing often begins with broad exploration to accumulate evidence before narrowing focus as the target becomes distinct. Our experiments reveal an analogous pattern in these models. This observation suggests that neither a fixed pruning schedule nor a heuristic layer-wise strategy can optimally accommodate the diverse complexities inherent in different inputs. To overcome this limitation, we introduce Complexity-Adaptive Pruning (AutoPrune), a training-free, plug-and-play framework that tailors pruning policies to varying sample and task complexities. Specifically, AutoPrune quantifies the mutual information between visual and textual tokens, then projects this signal to a budget-constrained logistic retention curve. Each such logistic curve, defined by its unique shape, corresponds to the specific complexity of different tasks and can guarantee adherence to predefined computational constraints. We evaluate AutoPrune on standard vision-language tasks and on Vision-Language-Action models for autonomous driving. Notably, when applied to LLaVA-1.5-7B, our method prunes 89% of visual tokens and reduces inference FLOPs by 76.8% while retaining 96.7% of the original accuracy averaged over all tasks. This corresponds to a 9.1% improvement over the recent work PDrop, demonstrating the effectiveness. Code is available at https://github.com/AutoLab-SAI-SJTU/AutoPrune.

AIOct 29, 2025
MTIR-SQL: Multi-turn Tool-Integrated Reasoning Reinforcement Learning for Text-to-SQL

Zekun Xu, Siyu Xia, Chuhuai Yue et al.

As large language models (LLMs) are increasingly used in Text-to-SQL tasks, Reinforcement Learning (RL) has become a common method for improving performance. Existing methods primarily rely on static execution feedback, which restricts real-time error correction. However, integrating multi-turn tool invocation along with dynamic feedback could significantly improve adaptability and robustness, ultimately enhancing model performance. To address these issues, we propose MTIR-SQL, an innovative Multi-turn Tool-Integrated Reasoning reinforcement learning framework for Text-to-SQL. Our approach introduces an execution-aware multi-turn reasoning paradigm that seamlessly incorporates database execution feedback at each reasoning step, enabling context-sensitive query generation and progressive refinement throughout the reasoning process. The framework extends the GRPO algorithm to accommodate complex multi-turn interaction scenarios. Considering the training instability characteristics of MTIR and the potential for significant Deviation of model distribution from the initial model, we enhance the GRPO algorithm by adding a trajectory filtering mechanism and removing KL loss constraints. Experimental results demonstrate that MTIR-SQL, with 4B parameters, achieves \textbf{64.4}\% accuracy in the BIRD Dev and 84.6% execution accuracy in the SPIDER Dev, significantly outperforming existing approaches.

IRJul 22, 2025
LLM-Enhanced Reranking for Complementary Product Recommendation

Zekun Xu, Yudi Zhang

Complementary product recommendation, which aims to suggest items that are used together to enhance customer value, is a crucial yet challenging task in e-commerce. While existing graph neural network (GNN) approaches have made significant progress in capturing complex product relationships, they often struggle with the accuracy-diversity tradeoff, particularly for long-tail items. This paper introduces a model-agnostic approach that leverages Large Language Models (LLMs) to enhance the reranking of complementary product recommendations. Unlike previous works that use LLMs primarily for data preprocessing and graph augmentation, our method applies LLM-based prompting strategies directly to rerank candidate items retrieved from existing recommendation models, eliminating the need for model retraining. Through extensive experiments on public datasets, we demonstrate that our approach effectively balances accuracy and diversity in complementary product recommendations, with at least 50% lift in accuracy metrics and 2% lift in diversity metrics on average for the top recommended items across datasets.

LGJul 7, 2021
Reconstructing Test Labels from Noisy Loss Functions

Abhinav Aggarwal, Shiva Prasad Kasiviswanathan, Zekun Xu et al.

Machine learning classifiers rely on loss functions for performance evaluation, often on a private (hidden) dataset. In a recent line of research, label inference was introduced as the problem of reconstructing the ground truth labels of this private dataset from just the (possibly perturbed) cross-entropy loss function values evaluated at chosen prediction vectors (without any other access to the hidden dataset). In this paper, we formally study the necessary and sufficient conditions under which label inference is possible from \emph{any} (noisy) loss function value. Using tools from analytical number theory, we show that a broad class of commonly used loss functions, including general Bregman divergence-based losses and multiclass cross-entropy with common activation functions like sigmoid and softmax, it is possible to design label inference attacks that succeed even for arbitrary noise levels and using only a single query from the adversary. We formally study the computational complexity of label inference and show that while in general, designing adversarial prediction vectors for these attacks is co-NP-hard, once we have these vectors, the attacks can also be carried out through a lightweight augmentation to any neural network model, making them look benign and hard to detect. The observations in this paper provide a deeper understanding of the vulnerabilities inherent in modern machine learning and could be used for designing future trustworthy ML.

LGMay 18, 2021
Label Inference Attacks from Log-loss Scores

Abhinav Aggarwal, Shiva Prasad Kasiviswanathan, Zekun Xu et al.

Log-loss (also known as cross-entropy loss) metric is ubiquitously used across machine learning applications to assess the performance of classification algorithms. In this paper, we investigate the problem of inferring the labels of a dataset from single (or multiple) log-loss score(s), without any other access to the dataset. Surprisingly, we show that for any finite number of label classes, it is possible to accurately infer the labels of the dataset from the reported log-loss score of a single carefully constructed prediction vector if we allow arbitrary precision arithmetic. Additionally, we present label inference algorithms (attacks) that succeed even under addition of noise to the log-loss scores and under limited precision arithmetic. All our algorithms rely on ideas from number theory and combinatorics and require no model training. We run experimental simulations on some real datasets to demonstrate the ease of running these attacks in practice.

CLApr 23, 2021
On a Utilitarian Approach to Privacy Preserving Text Generation

Zekun Xu, Abhinav Aggarwal, Oluwaseyi Feyisetan et al.

Differentially-private mechanisms for text generation typically add carefully calibrated noise to input words and use the nearest neighbor to the noised input as the output word. When the noise is small in magnitude, these mechanisms are susceptible to reconstruction of the original sensitive text. This is because the nearest neighbor to the noised input is likely to be the original input. To mitigate this empirical privacy risk, we propose a novel class of differentially private mechanisms that parameterizes the nearest neighbor selection criterion in traditional mechanisms. Motivated by Vickrey auction, where only the second highest price is revealed and the highest price is kept private, we balance the choice between the first and the second nearest neighbors in the proposed class of mechanisms using a tuning parameter. This parameter is selected by empirically solving a constrained optimization problem for maximizing utility, while maintaining the desired privacy guarantees. We argue that this empirical measurement framework can be used to align different mechanisms along a common benchmark for their privacy-utility tradeoff, particularly when different distance metrics are used to calibrate the amount of noise added. Our experiments on real text classification datasets show up to 50% improvement in utility compared to the existing state-of-the-art with the same empirical privacy guarantee.

LGDec 10, 2020
Research Challenges in Designing Differentially Private Text Generation Mechanisms

Oluwaseyi Feyisetan, Abhinav Aggarwal, Zekun Xu et al.

Accurately learning from user data while ensuring quantifiable privacy guarantees provides an opportunity to build better Machine Learning (ML) models while maintaining user trust. Recent literature has demonstrated the applicability of a generalized form of Differential Privacy to provide guarantees over text queries. Such mechanisms add privacy preserving noise to vectorial representations of text in high dimension and return a text based projection of the noisy vectors. However, these mechanisms are sub-optimal in their trade-off between privacy and utility. This is due to factors such as a fixed global sensitivity which leads to too much noise added in dense spaces while simultaneously guaranteeing protection for sensitive outliers. In this proposal paper, we describe some challenges in balancing the tradeoff between privacy and utility for these differentially private text mechanisms. At a high level, we provide two proposals: (1) a framework called LAC which defers some of the noise to a privacy amplification step and (2), an additional suite of three different techniques for calibrating the noise based on the local region around a word. Our objective in this paper is not to evaluate a single solution but to further the conversation on these challenges and chart pathways for building better mechanisms.

CLOct 22, 2020
A Differentially Private Text Perturbation Method Using a Regularized Mahalanobis Metric

Zekun Xu, Abhinav Aggarwal, Oluwaseyi Feyisetan et al.

Balancing the privacy-utility tradeoff is a crucial requirement of many practical machine learning systems that deal with sensitive customer data. A popular approach for privacy-preserving text analysis is noise injection, in which text data is first mapped into a continuous embedding space, perturbed by sampling a spherical noise from an appropriate distribution, and then projected back to the discrete vocabulary space. While this allows the perturbation to admit the required metric differential privacy, often the utility of downstream tasks modeled on this perturbed data is low because the spherical noise does not account for the variability in the density around different words in the embedding space. In particular, words in a sparse region are likely unchanged even when the noise scale is large. %Using the global sensitivity of the mechanism can potentially add too much noise to the words in the dense regions of the embedding space, causing a high utility loss, whereas using local sensitivity can leak information through the scale of the noise added. In this paper, we propose a text perturbation mechanism based on a carefully designed regularized variant of the Mahalanobis metric to overcome this problem. For any given noise scale, this metric adds an elliptical noise to account for the covariance structure in the embedding space. This heterogeneity in the noise scale along different directions helps ensure that the words in the sparse region have sufficient likelihood of replacement without sacrificing the overall utility. We provide a text-perturbation algorithm based on this metric and formally prove its privacy guarantees. Additionally, we empirically show that our mechanism improves the privacy statistics to achieve the same level of utility as compared to the state-of-the-art Laplace mechanism.

LGSep 27, 2020
Differentially Private Adversarial Robustness Through Randomized Perturbations

Nan Xu, Oluwaseyi Feyisetan, Abhinav Aggarwal et al.

Deep Neural Networks, despite their great success in diverse domains, are provably sensitive to small perturbations on correctly classified examples and lead to erroneous predictions. Recently, it was proposed that this behavior can be combatted by optimizing the worst case loss function over all possible substitutions of training examples. However, this can be prone to weighing unlikely substitutions higher, limiting the accuracy gain. In this paper, we study adversarial robustness through randomized perturbations, which has two immediate advantages: (1) by ensuring that substitution likelihood is weighted by the proximity to the original word, we circumvent optimizing the worst case guarantees and achieve performance gains; and (2) the calibrated randomness imparts differentially-private model training, which additionally improves robustness against adversarial attacks on the model outputs. Our approach uses a novel density-based mechanism based on truncated Gumbel noise, which ensures training on substitutions of both rare and dense words in the vocabulary while maintaining semantic similarity for model robustness.

LGSep 17, 2020
On Primes, Log-Loss Scores and (No) Privacy

Abhinav Aggarwal, Zekun Xu, Oluwaseyi Feyisetan et al.

Membership Inference Attacks exploit the vulnerabilities of exposing models trained on customer data to queries by an adversary. In a recently proposed implementation of an auditing tool for measuring privacy leakage from sensitive datasets, more refined aggregates like the Log-Loss scores are exposed for simulating inference attacks as well as to assess the total privacy leakage based on the adversary's predictions. In this paper, we prove that this additional information enables the adversary to infer the membership of any number of datapoints with full accuracy in a single query, causing complete membership privacy breach. Our approach obviates any attack model training or access to side knowledge with the adversary. Moreover, our algorithms are agnostic to the model under attack and hence, enable perfect membership inference even for models that do not memorize or overfit. In particular, our observations provide insight into the extent of information leakage from statistical aggregates and how they can be exploited.

APFeb 1, 2019
Automatic Hyperparameter Tuning Method for Local Outlier Factor, with Applications to Anomaly Detection

Zekun Xu, Deovrat Kakde, Arin Chaudhuri

In recent years, there have been many practical applications of anomaly detection such as in predictive maintenance, detection of credit fraud, network intrusion, and system failure. The goal of anomaly detection is to identify in the test data anomalous behaviors that are either rare or unseen in the training data. This is a common goal in predictive maintenance, which aims to forecast the imminent faults of an appliance given abundant samples of normal behaviors. Local outlier factor (LOF) is one of the state-of-the-art models used for anomaly detection, but the predictive performance of LOF depends greatly on the selection of hyperparameters. In this paper, we propose a novel, heuristic methodology to tune the hyperparameters in LOF. A tuned LOF model that uses the proposed method shows good predictive performance in both simulations and real data sets.