Xiaoyuan Liu

CL
h-index34
33papers
1,574citations
Novelty53%
AI Score61

33 Papers

LGJul 21, 2022Code
UniFed: All-In-One Federated Learning Platform to Unify Open-Source Frameworks

Xiaoyuan Liu, Tianneng Shi, Chulin Xie et al.

Federated Learning (FL) has become a practical and widely adopted distributed learning paradigm. However, the lack of a comprehensive and standardized solution covering diverse use cases makes it challenging to use in practice. In addition, selecting an appropriate FL framework for a specific use case can be a daunting task. In this work, we present UniFed, the first unified platform for standardizing existing open-source FL frameworks. The platform streamlines the end-to-end workflow for distributed experimentation and deployment, encompassing 11 popular open-source FL frameworks. In particular, to address the substantial variations in workflows and data formats, UniFed introduces a configuration-based schema-enforced task specification, offering 20 editable fields. UniFed also provides functionalities such as distributed execution management, logging, and data analysis. With UniFed, we evaluate and compare 11 popular FL frameworks from the perspectives of functionality, privacy protection, and performance, through conducting developer surveys and code-level investigation. We collect 15 diverse FL scenario setups (e.g., horizontal and vertical settings) for FL framework evaluation. This comprehensive evaluation allows us to analyze both model and system performance, providing detailed comparisons and offering recommendations for framework selection. UniFed simplifies the process of selecting and utilizing the appropriate FL framework for specific use cases, while enabling standardized distributed experimentation and deployment. Our results and analysis based on experiments with up to 178 distributed nodes provide valuable system design and deployment insights, aiming to empower practitioners in their pursuit of effective FL solutions.

96.9AIJun 3
Agents' Last Exam

Yiyou Sun, Xinyang Han, Weichen Zhang et al.

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.

QUANT-PHAug 23, 2024Code
QAdaPrune: Adaptive Parameter Pruning For Training Variational Quantum Circuits

Ankit Kulshrestha, Xiaoyuan Liu, Hayato Ushijima-Mwesigwa et al.

In the present noisy intermediate scale quantum computing era, there is a critical need to devise methods for the efficient implementation of gate-based variational quantum circuits. This ensures that a range of proposed applications can be deployed on real quantum hardware. The efficiency of quantum circuit is desired both in the number of trainable gates and the depth of the overall circuit. The major concern of barren plateaus has made this need for efficiency even more acute. The problem of efficient quantum circuit realization has been extensively studied in the literature to reduce gate complexity and circuit depth. Another important approach is to design a method to reduce the \emph{parameter complexity} in a variational quantum circuit. Existing methods include hyperparameter-based parameter pruning which introduces an additional challenge of finding the best hyperparameters for different applications. In this paper, we present \emph{QAdaPrune} - an adaptive parameter pruning algorithm that automatically determines the threshold and then intelligently prunes the redundant and non-performing parameters. We show that the resulting sparse parameter sets yield quantum circuits that perform comparably to the unpruned quantum circuits and in some cases may enhance trainability of the circuits even if the original quantum circuit gets stuck in a barren plateau.\\ \noindent{\bf Reproducibility}: The source code and data are available at \url{https://github.com/aicaffeinelife/QAdaPrune.git}

LGOct 18, 2022
Towards Practical Explainability with Cluster Descriptors

Xiaoyuan Liu, Ilya Tyagin, Hayato Ushijima-Mwesigwa et al.

With the rapid development of machine learning, improving its explainability has become a crucial research goal. We study the problem of making the clusters more explainable by investigating the cluster descriptors. Given a set of objects $S$, a clustering of these objects $π$, and a set of tags $T$ that have not participated in the clustering algorithm. Each object in $S$ is associated with a subset of $T$. The goal is to find a representative set of tags for each cluster, referred to as the cluster descriptors, with the constraint that these descriptors we find are pairwise disjoint, and the total size of all the descriptors is minimized. In general, this problem is NP-hard. We propose a novel explainability model that reinforces the previous models in such a way that tags that do not contribute to explainability and do not sufficiently distinguish between clusters are not added to the optimal descriptors. The proposed model is formulated as a quadratic unconstrained binary optimization problem which makes it suitable for solving on modern optimization hardware accelerators. We experimentally demonstrate how a proposed explainability model can be solved on specialized hardware for accelerating combinatorial optimization, the Fujitsu Digital Annealer, and use real-life Twitter and PubMed datasets for use cases.

QUANT-PHApr 15, 2023
Learning To Optimize Quantum Neural Network Without Gradients

Ankit Kulshrestha, Xiaoyuan Liu, Hayato Ushijima-Mwesigwa et al.

Quantum Machine Learning is an emerging sub-field in machine learning where one of the goals is to perform pattern recognition tasks by encoding data into quantum states. This extension from classical to quantum domain has been made possible due to the development of hybrid quantum-classical algorithms that allow a parameterized quantum circuit to be optimized using gradient based algorithms that run on a classical computer. The similarities in training of these hybrid algorithms and classical neural networks has further led to the development of Quantum Neural Networks (QNNs). However, in the current training regime for QNNs, the gradients w.r.t objective function have to be computed on the quantum device. This computation is highly non-scalable and is affected by hardware and sampling noise present in the current generation of quantum hardware. In this paper, we propose a training algorithm that does not rely on gradient information. Specifically, we introduce a novel meta-optimization algorithm that trains a \emph{meta-optimizer} network to output parameters for the quantum circuit such that the objective function is minimized. We empirically and theoretically show that we achieve a better quality minima in fewer circuit evaluations than existing gradient based algorithms on different datasets.

LGOct 18, 2023
Effective and Efficient Federated Tree Learning on Hybrid Data

Qinbin Li, Chulin Xie, Xiaojun Xu et al.

Federated learning has emerged as a promising distributed learning paradigm that facilitates collaborative learning among multiple parties without transferring raw data. However, most existing federated learning studies focus on either horizontal or vertical data settings, where the data of different parties are assumed to be from the same feature or sample space. In practice, a common scenario is the hybrid data setting, where data from different parties may differ both in the features and samples. To address this, we propose HybridTree, a novel federated learning approach that enables federated tree learning on hybrid data. We observe the existence of consistent split rules in trees. With the help of these split rules, we theoretically show that the knowledge of parties can be incorporated into the lower layers of a tree. Based on our theoretical analysis, we propose a layer-level solution that does not need frequent communication traffic to train a tree. Our experiments demonstrate that HybridTree can achieve comparable accuracy to the centralized setting with low computational and communication overhead. HybridTree can achieve up to 8 times speedup compared with the other baselines.

SDJul 15, 2024
DDFAD: Dataset Distillation Framework for Audio Data

Wenbo Jiang, Rui Zhang, Hongwei Li et al.

Deep neural networks (DNNs) have achieved significant success in numerous applications. The remarkable performance of DNNs is largely attributed to the availability of massive, high-quality training datasets. However, processing such massive training data requires huge computational and storage resources. Dataset distillation is a promising solution to this problem, offering the capability to compress a large dataset into a smaller distilled dataset. The model trained on the distilled dataset can achieve comparable performance to the model trained on the whole dataset. While dataset distillation has been demonstrated in image data, none have explored dataset distillation for audio data. In this work, for the first time, we propose a Dataset Distillation Framework for Audio Data (DDFAD). Specifically, we first propose the Fused Differential MFCC (FD-MFCC) as extracted features for audio data. After that, the FD-MFCC is distilled through the matching training trajectory distillation method. Finally, we propose an audio signal reconstruction algorithm based on the Griffin-Lim Algorithm to reconstruct the audio signal from the distilled FD-MFCC. Extensive experiments demonstrate the effectiveness of DDFAD on various audio datasets. In addition, we show that DDFAD has promising application prospects in many applications, such as continual learning and neural architecture search.

CVFeb 16, 2025Code
AI Sees Your Location, But With A Bias Toward The Wealthy World

Jingyuan Huang, Jen-tse Huang, Ziyi Liu et al. · pku, tencent-ai

Visual-Language Models (VLMs) have shown remarkable performance across various tasks, particularly in recognizing geographic information from images. However, VLMs still show regional biases in this task. To systematically evaluate these issues, we introduce a benchmark consisting of 1,200 images paired with detailed geographic metadata. Evaluating four VLMs, we find that while these models demonstrate the ability to recognize geographic information from images, achieving up to 53.8% accuracy in city prediction, they exhibit significant biases. Specifically, performance is substantially higher for economically developed and densely populated regions compared to less developed (-12.5%) and sparsely populated (-17.0%) areas. Moreover, regional biases of frequently over-predicting certain locations remain. For instance, they consistently predict Sydney for images taken in Australia, shown by the low entropy scores for these countries. The strong performance of VLMs also raises privacy concerns, particularly for users who share images online without the intent of being identified. Our code and dataset are publicly available at https://github.com/uscnlp-lime/FairLocator.

CLJan 7, 2025Code
Can LLMs Ask Good Questions?

Yueheng Zhang, Xiaoyuan Liu, Yiyou Sun et al.

We evaluate questions generated by large language models (LLMs) from context, comparing them to human-authored questions across six dimensions: question type, question length, context coverage, answerability, uncommonness, and required answer length. Our study spans two open-source and two proprietary state-of-the-art models. Results reveal that LLM-generated questions tend to demand longer descriptive answers and exhibit more evenly distributed context focus, in contrast to the positional bias often seen in QA tasks. These findings provide insights into the distinctive characteristics of LLM-generated questions and inform future work on question quality and downstream applications.

CVFeb 23, 2025Code
Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang et al.

Despite significant progress on popular multimodal benchmarks, state-of-the-art Multimodal Large Language Models (MLLMs) continue to struggle with basic visual reasoning tasks that are trivially solved by humans, such as recognizing spatial relationships. To systematically investigate this gap, we introduce VisFactor, a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment. These subtests span four core domains of human visual cognition: (1) Visualization and Spatial Processing, (2) Perceptual and Closure, (3) Memory, and (4) Reasoning. We evaluate 20 frontier MLLMs from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families. The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination, regardless of model size or prompting strategy. These findings suggest that current MLLM performance gains on high-level benchmarks do not reflect human-like low-level visual cognition, challenging the assumption that large-scale pretraining naturally induces gestalt-like perceptual capabilities. The dataset and evaluation toolkit are publicly available at: https://github.com/CUHK-ARISE/VisFactor.

LGOct 15, 2023
Model Inversion Attacks on Homogeneous and Heterogeneous Graph Neural Networks

Renyang Liu, Wei Zhou, Jinhong Zhang et al.

Recently, Graph Neural Networks (GNNs), including Homogeneous Graph Neural Networks (HomoGNNs) and Heterogeneous Graph Neural Networks (HeteGNNs), have made remarkable progress in many physical scenarios, especially in communication applications. Despite achieving great success, the privacy issue of such models has also received considerable attention. Previous studies have shown that given a well-fitted target GNN, the attacker can reconstruct the sensitive training graph of this model via model inversion attacks, leading to significant privacy worries for the AI service provider. We advocate that the vulnerability comes from the target GNN itself and the prior knowledge about the shared properties in real-world graphs. Inspired by this, we propose a novel model inversion attack method on HomoGNNs and HeteGNNs, namely HomoGMI and HeteGMI. Specifically, HomoGMI and HeteGMI are gradient-descent-based optimization methods that aim to maximize the cross-entropy loss on the target GNN and the $1^{st}$ and $2^{nd}$-order proximities on the reconstructed graph. Notably, to the best of our knowledge, HeteGMI is the first attempt to perform model inversion attacks on HeteGNNs. Extensive experiments on multiple benchmarks demonstrate that the proposed method can achieve better performance than the competitors.

DCJan 4, 2025Code
DeServe: Towards Affordable Offline LLM Inference via Decentralization

Linyu Wu, Xiaoyuan Liu, Tianneng Shi et al.

The rapid growth of generative AI and its integration into everyday workflows have significantly increased the demand for large language model (LLM) inference services. While proprietary models remain popular, recent advancements in open-source LLMs have positioned them as strong contenders. However, deploying these models is often constrained by the high costs and limited availability of GPU resources. In response, this paper presents the design of a decentralized offline serving system for LLM inference. Utilizing idle GPU resources, our proposed system, DeServe, decentralizes access to LLMs at a lower cost. DeServe specifically addresses key challenges in optimizing serving throughput in high-latency network environments. Experiments demonstrate that DeServe achieves a 6.7x-12.6x improvement in throughput over existing serving system baselines in such conditions.

54.4QUANT-PHMay 13
CO-MAP: A Reinforcement Learning Approach to the Qubit Allocation Problem

Ankit Kulshrestha, Xiaoyuan Liu

A quantum compiler is a critical piece in the quantum computing pipeline since it allows an abstract quantum circuit to be run on a physical quantum computer. One extremely important subproblem in quantum compilation is the generation of a logical to physical qubit mapping. Typically in quantum compilers this step is either implemented as a random or a heuristic based assignment that aims to minimize additional (SWAP) gate overhead in the quantum circuit. In this paper, we present an alternative approach to solving the qubit mapping problem. Specifically, we formulate the qubit mapping problem with a combinatorial optimization (CO) objective. We then present a method to find a solution to the CO problem by training a reinforcement learning (RL) policy. We also propose a local search based post-processing algorithm to further reduce the overhead. Our results show a dramatic improvement over conventional techniques in reducing the number of SWAPs. On different real world datasets like MQTBench and Queko circuits, our trained policy achieves a \textbf{65-85\%} reduction in SWAP overhead when compared to existing quantum compilers.

68.4QUANT-PHMay 12
QAP-Router: Tackling Qubit Routing as Dynamic Quadratic Assignment with Reinforcement Learning

Kien X. Nguyen, Ankit Kulshrestha, Ilya Safro et al.

Qubit routing is a fundamental problem in quantum compilation, known to be NP-hard. Its dynamic nature makes local routing decisions propagate and compound over time, making global efficient solutions challenging. Existing heuristic methods rely on local rules with limited lookahead, while recent learning-based approaches often treat routing as a generic sequential decision problem without fully exploiting its underlying structure. In this paper, we introduce QAP-Router, framing qubit routing based on a dynamic Quadratic Assignment Problem (QAP) formulation. By modeling logical interactions, or quantum gates, as flow matrices and hardware topology as a distance matrix, our approach captures the interaction-distance coupling in a unified objective, which defines the reward in the reinforcement learning environment. To further exploit this structure, the policy network employs a solution-aware Transformer backbone that encodes the interaction between the flow matrix and the distance matrix into the attention mechanism. We also integrate a lookahead mechanism that blends naturally into the QAP framework, preventing myopic decisions. Extensive experiments on 1,831 real-world quantum circuits from the MQTBench, AgentQ and QUEKO datasets show that our method substantially reduces the CNOT gate count of routed circuits by 15.7%, 30.4% and 12.1%, respectively, relative to existing industry compilers.

CRJan 19, 2021Code
A System for Automated Open-Source Threat Intelligence Gathering and Management

Peng Gao, Xiaoyuan Liu, Edward Choi et al.

To remain aware of the fast-evolving cyber threat landscape, open-source Cyber Threat Intelligence (OSCTI) has received growing attention from the community. Commonly, knowledge about threats is presented in a vast number of OSCTI reports. Despite the pressing need for high-quality OSCTI, existing OSCTI gathering and management platforms, however, have primarily focused on isolated, low-level Indicators of Compromise. On the other hand, higher-level concepts (e.g., adversary tactics, techniques, and procedures) and their relationships have been overlooked, which contain essential knowledge about threat behaviors that is critical to uncovering the complete threat scenario. To bridge the gap, we propose SecurityKG, a system for automated OSCTI gathering and management. SecurityKG collects OSCTI reports from various sources, uses a combination of AI and NLP techniques to extract high-fidelity knowledge about threat behaviors, and constructs a security knowledge graph. SecurityKG also provides a UI that supports various types of interactivity to facilitate knowledge graph exploration.

CRJan 17, 2021Code
A System for Efficiently Hunting for Cyber Threats in Computer Systems Using Threat Intelligence

Peng Gao, Fei Shao, Xiaoyuan Liu et al.

Log-based cyber threat hunting has emerged as an important solution to counter sophisticated cyber attacks. However, existing approaches require non-trivial efforts of manual query construction and have overlooked the rich external knowledge about threat behaviors provided by open-source Cyber Threat Intelligence (OSCTI). To bridge the gap, we build ThreatRaptor, a system that facilitates cyber threat hunting in computer systems using OSCTI. Built upon mature system auditing frameworks, ThreatRaptor provides (1) an unsupervised, light-weight, and accurate NLP pipeline that extracts structured threat behaviors from unstructured OSCTI text, (2) a concise and expressive domain-specific query language, TBQL, to hunt for malicious system activities, (3) a query synthesis mechanism that automatically synthesizes a TBQL query from the extracted threat behaviors, and (4) an efficient query execution engine to search the big system audit logging data.

CROct 26, 2020Code
Enabling Efficient Cyber Threat Hunting With Cyber Threat Intelligence

Peng Gao, Fei Shao, Xiaoyuan Liu et al.

Log-based cyber threat hunting has emerged as an important solution to counter sophisticated attacks. However, existing approaches require non-trivial efforts of manual query construction and have overlooked the rich external threat knowledge provided by open-source Cyber Threat Intelligence (OSCTI). To bridge the gap, we propose ThreatRaptor, a system that facilitates threat hunting in computer systems using OSCTI. Built upon system auditing frameworks, ThreatRaptor provides (1) an unsupervised, light-weight, and accurate NLP pipeline that extracts structured threat behaviors from unstructured OSCTI text, (2) a concise and expressive domain-specific query language, TBQL, to hunt for malicious system activities, (3) a query synthesis mechanism that automatically synthesizes a TBQL query for hunting, and (4) an efficient query execution engine to search the big audit logging data. Evaluations on a broad set of attack cases demonstrate the accuracy and efficiency of ThreatRaptor in practical threat hunting.

85.6CRMar 11
The Attack and Defense Landscape of Agentic AI: A Comprehensive Survey

Juhee Kim, Xiaoyuan Liu, Zhun Wang et al.

AI agents that combine large language models with non-AI system components are rapidly emerging in real-world applications, offering unprecedented automation and flexibility. However, this unprecedented flexibility introduces complex security challenges fundamentally different from those in traditional software systems. This paper presents the first systematic and comprehensive survey of AI agent security, including an analysis of the design space, attack landscape, and defense mechanisms for secure AI agent systems. We further conduct case studies to point out existing gaps in securing agentic AI systems and identify open challenges in this emerging domain. Our work also introduces the first systematic framework for understanding the security risks and defense strategies of AI agents, serving as a foundation for building both secure agentic systems and advancing research in this critical area.

CLMar 4, 2025
The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models

Ke Ji, Jiahao Xu, Tian Liang et al.

Improving the reasoning capabilities of large language models (LLMs) typically requires supervised fine-tuning with labeled data or computationally expensive sampling. We introduce Unsupervised Prefix Fine-Tuning (UPFT), which leverages the observation of Prefix Self-Consistency -- the shared initial reasoning steps across diverse solution trajectories -- to enhance LLM reasoning efficiency. By training exclusively on the initial prefix substrings (as few as 8 tokens), UPFT removes the need for labeled data or exhaustive sampling. Experiments on reasoning benchmarks show that UPFT matches the performance of supervised methods such as Rejection Sampling Fine-Tuning, while reducing training time by 75% and sampling cost by 99%. Further analysis reveals that errors tend to appear in later stages of the reasoning process and that prefix-based training preserves the model's structural knowledge. This work demonstrates how minimal unsupervised fine-tuning can unlock substantial reasoning gains in LLMs, offering a scalable and resource-efficient alternative to conventional approaches.

AIFeb 12
The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context

Xiaoyuan Liu, Tian Liang, Dongyang Ma et al.

In the world of Harry Potter, when Dumbledore's mind is overburdened, he extracts memories into a Pensieve to be revisited later. In the world of AI, while we possess the Pensieve-mature databases and retrieval systems, our models inexplicably lack the "wand" to operate it. They remain like a Dumbledore without agency, passively accepting a manually engineered context as their entire memory. This work finally places the wand in the model's hand. We introduce StateLM, a new class of foundation models endowed with an internal reasoning loop to manage their own state. We equip our model with a suite of memory tools, such as context pruning, document indexing, and note-taking, and train it to actively manage these tools. By learning to dynamically engineering its own context, our model breaks free from the architectural prison of a fixed window. Experiments across various model sizes demonstrate StateLM's effectiveness across diverse scenarios. On long-document QA tasks, StateLMs consistently outperform standard LLMs across all model scales; on the chat memory task, they achieve absolute accuracy improvements of 10% to 20% over standard LLMs. On the deep research task BrowseComp-Plus, the performance gap becomes even more pronounced: StateLM achieves up to 52% accuracy, whereas standard LLM counterparts struggle around 5%. Ultimately, our approach shifts LLMs from passive predictors to state-aware agents where reasoning becomes a stateful and manageable process.

AIMay 19, 2025
Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

Xiaoyuan Liu, Tian Liang, Zhiwei He et al.

Large Language Models (LLMs) show great promise in complex reasoning, with Reinforcement Learning with Verifiable Rewards (RLVR) being a key enhancement strategy. However, a prevalent issue is ``superficial self-reflection'', where models fail to robustly verify their own outputs. We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this. RISE explicitly and simultaneously trains an LLM to improve both its problem-solving and self-verification abilities within a single, integrated RL process. The core mechanism involves leveraging verifiable rewards from an outcome verifier to provide on-the-fly feedback for both solution generation and self-verification tasks. In each iteration, the model generates solutions, then critiques its own on-policy generated solutions, with both trajectories contributing to the policy update. Extensive experiments on diverse mathematical reasoning benchmarks show that RISE consistently improves model's problem-solving accuracy while concurrently fostering strong self-verification skills. Our analyses highlight the advantages of online verification and the benefits of increased verification compute. Additionally, RISE models exhibit more frequent and accurate self-verification behaviors during reasoning. These advantages reinforce RISE as a flexible and effective path towards developing more robust and self-aware reasoners.

CLFeb 16, 2025
Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs

Wenxuan Wang, Xiaoyuan Liu, Kuiyi Gao et al. · pku, tencent-ai

Multimodal Large Language Models (MLLMs) have expanded the capabilities of traditional language models by enabling interaction through both text and images. However, ensuring the safety of these models remains a significant challenge, particularly in accurately identifying whether multimodal content is safe or unsafe-a capability we term safety awareness. In this paper, we introduce MMSafeAware, the first comprehensive multimodal safety awareness benchmark designed to evaluate MLLMs across 29 safety scenarios with 1500 carefully curated image-prompt pairs. MMSafeAware includes both unsafe and over-safety subsets to assess models abilities to correctly identify unsafe content and avoid over-sensitivity that can hinder helpfulness. Evaluating nine widely used MLLMs using MMSafeAware reveals that current models are not sufficiently safe and often overly sensitive; for example, GPT-4V misclassifies 36.1% of unsafe inputs as safe and 59.9% of benign inputs as unsafe. We further explore three methods to improve safety awareness-prompting-based approaches, visual contrastive decoding, and vision-centric reasoning fine-tuning-but find that none achieve satisfactory performance. Our findings highlight the profound challenges in developing MLLMs with robust safety awareness, underscoring the need for further research in this area. All the code and data will be publicly available to facilitate future research.

QUANT-PHApr 29, 2025
QAOA Parameter Transferability for Maximum Independent Set using Graph Attention Networks

Hanjing Xu, Xiaoyuan Liu, Alex Pothen et al.

The quantum approximate optimization algorithm (QAOA) is one of the promising variational approaches of quantum computing to solve combinatorial optimization problems. In QAOA, variational parameters need to be optimized by solving a series of nonlinear, nonconvex optimization programs. In this work, we propose a QAOA parameter transfer scheme using Graph Attention Networks (GAT) to solve Maximum Independent Set (MIS) problems. We prepare optimized parameters for graphs of 12 and 14 vertices and use GATs to transfer their parameters to larger graphs. Additionally, we design a hybrid distributed resource-aware algorithm for MIS (HyDRA-MIS), which decomposes large problems into smaller ones that can fit onto noisy intermediate-scale quantum (NISQ) computers. We integrate our GAT-based parameter transfer approach to HyDRA-MIS and demonstrate competitive results compared to KaMIS, a state-of-the-art classical MIS solver, on graphs with several thousands vertices.

CVMar 24, 2024
Predicate Debiasing in Vision-Language Models Integration for Scene Graph Generation Enhancement

Yuxuan Wang, Xiaoyuan Liu

Scene Graph Generation (SGG) provides basic language representation of visual scenes, requiring models to grasp complex and diverse semantics between objects. This complexity and diversity in SGG leads to underrepresentation, where parts of triplet labels are rare or even unseen during training, resulting in imprecise predictions. To tackle this, we propose integrating the pretrained Vision-language Models to enhance representation. However, due to the gap between pretraining and SGG, direct inference of pretrained VLMs on SGG leads to severe bias, which stems from the imbalanced predicates distribution in the pretraining language set. To alleviate the bias, we introduce a novel LM Estimation to approximate the unattainable predicates distribution. Finally, we ensemble the debiased VLMs with SGG models to enhance the representation, where we design a certainty-aware indicator to score each sample and dynamically adjust the ensemble weights. Our training-free method effectively addresses the predicates bias in pretrained VLMs, enhances SGG's representation, and significantly improve the performance.

CLOct 26, 2025
Scalable Supervising Software Agents with Patch Reasoner

Junjielong Xu, Boyin Tan, Xiaoyuan Liu et al.

While large language model agents have advanced software engineering tasks, the unscalable nature of existing test-based supervision is limiting the potential improvement of data scaling. The reason is twofold: (1) building and running test sandbox is rather heavy and fragile, and (2) data with high-coverage tests is naturally rare and threatened by test hacking via edge cases. In this paper, we propose R4P, a patch verifier model to provide scalable rewards for training and testing SWE agents via reasoning. We consider that patch verification is fundamentally a reasoning task, mirroring how human repository maintainers review patches without writing and running new reproduction tests. To obtain sufficient reference and reduce the risk of reward hacking, R4P uses a group-wise objective for RL training, enabling it to verify multiple patches against each other's modification and gain a dense reward for stable training. R4P achieves 72.2% Acc. for verifying patches from SWE-bench-verified, surpassing OpenAI o3. To demonstrate R4P's practicality, we design and train a lite scaffold, Mini-SE, with pure reinforcement learning where all rewards are derived from R4P. As a result, Mini-SE achieves 26.2% Pass@1 on SWE-bench-verified, showing a 10.0% improvement over the original Qwen3-32B. This can be further improved to 32.8% with R4P for test-time scaling. Furthermore, R4P verifies patches within a second, 50x faster than testing on average. The stable scaling curves of rewards and accuracy along with high efficiency reflect R4P's practicality.

CLOct 9, 2025
Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

Youliang Yuan, Qiuyang Mang, Jingbang Chen et al. · pku, tencent-ai

Large language models for mathematical reasoning are typically trained with outcome-based rewards, which credit only the final answer. In our experiments, we observe that this paradigm is highly susceptible to reward hacking, leading to a substantial overestimation of a model's reasoning ability. This is evidenced by a high incidence of false positives - solutions that reach the correct final answer through an unsound reasoning process. Through a systematic analysis with human verification, we establish a taxonomy of these failure modes, identifying patterns like Miracle Steps - abrupt jumps to a correct output without a valid preceding derivation. Probing experiments suggest a strong association between these Miracle Steps and memorization, where the model appears to recall the answer directly rather than deriving it. To mitigate this systemic issue, we introduce the Rubric Reward Model (RRM), a process-oriented reward function that evaluates the entire reasoning trajectory against problem-specific rubrics. The generative RRM provides fine-grained, calibrated rewards (0-1) that explicitly penalize logical flaws and encourage rigorous deduction. When integrated into a reinforcement learning pipeline, RRM-based training consistently outperforms outcome-only supervision across four math benchmarks. Notably, it boosts Verified Pass@1024 on AIME2024 from 26.7% to 62.6% and reduces the incidence of Miracle Steps by 71%. Our work demonstrates that rewarding the solution process is crucial for building models that are not only more accurate but also more reliable.

QUANT-PHSep 18, 2025
Neural Architecture Search Algorithms for Quantum Autoencoders

Ankit Kulshrestha, Xiaoyuan Liu, Hayato Ushijima-Mwesigwa et al.

The design of quantum circuits is currently driven by the specific objectives of the quantum algorithm in question. This approach thus relies on a significant manual effort by the quantum algorithm designer to design an appropriate circuit for the task. However this approach cannot scale to more complex quantum algorithms in the future without exponentially increasing the circuit design effort and introducing unwanted inductive biases. Motivated by this observation, we propose to automate the process of cicuit design by drawing inspiration from Neural Architecture Search (NAS). In this work, we propose two Quantum-NAS algorithms that aim to find efficient circuits given a particular quantum task. We choose quantum data compression as our driver quantum task and demonstrate the performance of our algorithms by finding efficient autoencoder designs that outperform baselines on three different tasks - quantum data denoising, classical data compression and pure quantum data compression. Our results indicate that quantum NAS algorithms can significantly alleviate the manual effort while delivering performant quantum circuits for any given task.

CVAug 5, 2025
MetaScope: Optics-Driven Neural Network for Ultra-Micro Metalens Endoscopy

Wuyang Li, Wentao Pan, Xiaoyuan Liu et al.

Miniaturized endoscopy has advanced accurate visual perception within the human body. Prevailing research remains limited to conventional cameras employing convex lenses, where the physical constraints with millimetre-scale thickness impose serious impediments on the micro-level clinical. Recently, with the emergence of meta-optics, ultra-micro imaging based on metalenses (micron-scale) has garnered great attention, serving as a promising solution. However, due to the physical difference of metalens, there is a large gap in data acquisition and algorithm research. In light of this, we aim to bridge this unexplored gap, advancing the novel metalens endoscopy. First, we establish datasets for metalens endoscopy and conduct preliminary optical simulation, identifying two derived optical issues that physically adhere to strong optical priors. Second, we propose MetaScope, a novel optics-driven neural network tailored for metalens endoscopy driven by physical optics. MetaScope comprises two novel designs: Optics-informed Intensity Adjustment (OIA), rectifying intensity decay by learning optical embeddings, and Optics-informed Chromatic Correction (OCC), mitigating chromatic aberration by learning spatial deformations informed by learned Point Spread Function (PSF) distributions. To enhance joint learning, we further deploy a gradient-guided distillation to transfer knowledge from the foundational model adaptively. Extensive experiments demonstrate that MetaScope not only outperforms state-of-the-art methods in both metalens segmentation and restoration but also achieves impressive generalized ability in real biomedical scenes.

CVJun 12, 2025
Defensive Adversarial CAPTCHA: A Semantics-Driven Framework for Natural Adversarial Example Generation

Xia Du, Xiaoyuan Liu, Jizhe Zhou et al.

Traditional CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) schemes are increasingly vulnerable to automated attacks powered by deep neural networks (DNNs). Existing adversarial attack methods often rely on the original image characteristics, resulting in distortions that hinder human interpretation and limit their applicability in scenarios where no initial input images are available. To address these challenges, we propose the Unsourced Adversarial CAPTCHA (DAC), a novel framework that generates high-fidelity adversarial examples guided by attacker-specified semantics information. Leveraging a Large Language Model (LLM), DAC enhances CAPTCHA diversity and enriches the semantic information. To address various application scenarios, we examine the white-box targeted attack scenario and the black box untargeted attack scenario. For target attacks, we introduce two latent noise variables that are alternately guided in the diffusion step to achieve robust inversion. The synergy between gradient guidance and latent variable optimization achieved in this way ensures that the generated adversarial examples not only accurately align with the target conditions but also achieve optimal performance in terms of distributional consistency and attack effectiveness. In untargeted attacks, especially for black-box scenarios, we introduce bi-path unsourced adversarial CAPTCHA (BP-DAC), a two-step optimization strategy employing multimodal gradients and bi-path optimization for efficient misclassification. Experiments show that the defensive adversarial CAPTCHA generated by BP-DAC is able to defend against most of the unknown models, and the generated CAPTCHA is indistinguishable to both humans and DNNs.

CLSep 22, 2020
Structured Hierarchical Dialogue Policy with Graph Neural Networks

Zhi Chen, Xiaoyuan Liu, Lu Chen et al.

Dialogue policy training for composite tasks, such as restaurant reservation in multiple places, is a practically important and challenging problem. Recently, hierarchical deep reinforcement learning (HDRL) methods have achieved good performance in composite tasks. However, in vanilla HDRL, both top-level and low-level policies are all represented by multi-layer perceptrons (MLPs) which take the concatenation of all observations from the environment as the input for predicting actions. Thus, traditional HDRL approach often suffers from low sampling efficiency and poor transferability. In this paper, we address these problems by utilizing the flexibility of graph neural networks (GNNs). A novel ComNet is proposed to model the structure of a hierarchical agent. The performance of ComNet is tested on composited tasks of the PyDial benchmark. Experiments show that ComNet outperforms vanilla HDRL systems with performance close to the upper bound. It not only achieves sample efficiency but also is more robust to noise while maintaining the transferability to other composite tasks.

CLSep 22, 2020
Distributed Structured Actor-Critic Reinforcement Learning for Universal Dialogue Management

Zhi Chen, Lu Chen, Xiaoyuan Liu et al.

The task-oriented spoken dialogue system (SDS) aims to assist a human user in accomplishing a specific task (e.g., hotel booking). The dialogue management is a core part of SDS. There are two main missions in dialogue management: dialogue belief state tracking (summarising conversation history) and dialogue decision-making (deciding how to reply to the user). In this work, we only focus on devising a policy that chooses which dialogue action to respond to the user. The sequential system decision-making process can be abstracted into a partially observable Markov decision process (POMDP). Under this framework, reinforcement learning approaches can be used for automated policy optimization. In the past few years, there are many deep reinforcement learning (DRL) algorithms, which use neural networks (NN) as function approximators, investigated for dialogue policy.

CRJul 5, 2020
BeeTrace: A Unified Platform for Secure Contact Tracing that Breaks Data Silos

Xiaoyuan Liu, Ni Trieu, Evgenios M. Kornaropoulos et al.

Contact tracing is an important method to control the spread of an infectious disease such as COVID-19. However, existing contact tracing methods alone cannot provide sufficient coverage and do not successfully address privacy concerns of the participating entities. Current solutions do not utilize the huge volume of data stored in business databases and individual digital devices. This information is typically stored in data silos and cannot be used due to regulations in place. To successfully unlock the potential of contact tracing, we need to consider both data utilization from multiple sources and the privacy of the participating parties. To this end, we propose BeeTrace, a unified platform that breaks data silos and deploys state-of-the-art cryptographic protocols to guarantee privacy goals.

CLApr 13, 2020
Pretrained Transformers Improve Out-of-Distribution Robustness

Dan Hendrycks, Xiaoyuan Liu, Eric Wallace et al.

Although pretrained Transformers such as BERT achieve high accuracy on in-distribution examples, do they generalize to new distributions? We systematically measure out-of-distribution (OOD) generalization for seven NLP datasets by constructing a new robustness benchmark with realistic distribution shifts. We measure the generalization of previous models including bag-of-words models, ConvNets, and LSTMs, and we show that pretrained Transformers' performance declines are substantially smaller. Pretrained transformers are also more effective at detecting anomalous or OOD examples, while many previous models are frequently worse than chance. We examine which factors affect robustness, finding that larger models are not necessarily more robust, distillation can be harmful, and more diverse pretraining data can enhance robustness. Finally, we show where future work can improve OOD robustness.