CLMay 28
EvoRubric: Self-Evolving Rubric-Driven RL for Open-Ended GenerationXin Guan, Xiaomeng Hu, Shen Huang et al.
Reinforcement Learning (RL) has significantly advanced Large Language Models (LLMs) in verifiable domains, but aligning models for open-ended generation remains profoundly challenging due to the lack of definitive rewards. Current rubric-based RL methods mitigate this by employing explicit criteria; however, they rely heavily on static, human-annotated rubrics that inevitably cause policy lag, or expensive external proprietary models for dynamic updates. In this paper, we propose EvoRubric, a novel single-policy co-evolutionary RL framework that eliminates the reliance on static criteria and on external rubric generators. By unifying response generation and rubric generation under a single parameterized policy, EvoRubric dynamically alternates between a Reasoner and a Rubric Generator. To prevent reward hacking and ensure the reliability of generated signals, we introduce a multi-level verification pipeline featuring a meta-verifier, zero-variance pruning, and a Leave-One-Out peer consensus mechanism. Validated criteria are dynamically archived into a memory pool, yielding dense, multi-objective rewards to continuously co-optimize both roles. Extensive experiments across Medical, Writing, and Science domains demonstrate that EvoRubric consistently outperforms traditional static and external-LLM-driven alignment methods. Notably, our framework is compatible with human-expert priors. When initialized with expert-annotated rubrics, EvoRubric can further uncover novel, discriminative dimensions, achieving better performance than relying solely on static expert annotations.
SPDec 23, 2022
Simple Yet Surprisingly Effective Training Strategies for LSTMs in Sensor-Based Human Activity RecognitionShuai Shao, Yu Guan, Xin Guan et al.
Human Activity Recognition (HAR) is one of the core research areas in mobile and wearable computing. With the application of deep learning (DL) techniques such as CNN, recognizing periodic or static activities (e.g, walking, lying, cycling, etc.) has become a well studied problem. What remains a major challenge though is the sporadic activity recognition (SAR) problem, where activities of interest tend to be non periodic, and occur less frequently when compared with the often large amount of irrelevant background activities. Recent works suggested that sequential DL models (such as LSTMs) have great potential for modeling nonperiodic behaviours, and in this paper we studied some LSTM training strategies for SAR. Specifically, we proposed two simple yet effective LSTM variants, namely delay model and inverse model, for two SAR scenarios (with and without time critical requirement). For time critical SAR, the delay model can effectively exploit predefined delay intervals (within tolerance) in form of contextual information for improved performance. For regular SAR task, the second proposed, inverse model can learn patterns from the time series in an inverse manner, which can be complementary to the forward model (i.e.,LSTM), and combining both can boost the performance. These two LSTM variants are very practical, and they can be deemed as training strategies without alteration of the LSTM fundamentals. We also studied some additional LSTM training strategies, which can further improve the accuracy. We evaluated our models on two SAR and one non-SAR datasets, and the promising results demonstrated the effectiveness of our approaches in HAR applications.
IRAug 29, 2024
HyPA-RAG: A Hybrid Parameter Adaptive Retrieval-Augmented Generation System for AI Legal and Policy ApplicationsRishi Kalra, Zekun Wu, Ayesha Gulley et al.
Large Language Models (LLMs) face limitations in AI legal and policy applications due to outdated knowledge, hallucinations, and poor reasoning in complex contexts. Retrieval-Augmented Generation (RAG) systems address these issues by incorporating external knowledge, but suffer from retrieval errors, ineffective context integration, and high operational costs. This paper presents the Hybrid Parameter-Adaptive RAG (HyPA-RAG) system, designed for the AI legal domain, with NYC Local Law 144 (LL144) as the test case. HyPA-RAG integrates a query complexity classifier for adaptive parameter tuning, a hybrid retrieval approach combining dense, sparse, and knowledge graph methods, and a comprehensive evaluation framework with tailored question types and metrics. Testing on LL144 demonstrates that HyPA-RAG enhances retrieval accuracy, response fidelity, and contextual precision, offering a robust and adaptable solution for high-stakes legal and policy applications.
CLSep 17, 2024
SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness CalibrationXin Guan, Ze Wang, Nathaniel Demchak et al.
The development of unbiased large language models is widely recognized as crucial, yet existing benchmarks fall short in detecting biases due to limited scope, contamination, and lack of a fairness baseline. SAGED(bias) is the first holistic benchmarking pipeline to address these problems. The pipeline encompasses five core stages: scraping materials, assembling benchmarks, generating responses, extracting numeric features, and diagnosing with disparity metrics. SAGED includes metrics for max disparity, such as impact ratio, and bias concentration, such as Max Z-scores. Noticing that metric tool bias and contextual bias in prompts can distort evaluation, SAGED implements counterfactual branching and baseline calibration for mitigation. For demonstration, we use SAGED on G20 Countries with popular 8b-level models including Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we find that while Mistral and Qwen2 show lower max disparity and higher bias concentration than Gemma2 and Llama3.1, all models are notably biased against countries like Russia and (except for Qwen2) China. With further experiments to have models role-playing U.S. presidents, we see bias amplifies and shifts in heterogeneous directions. Moreover, we see Qwen2 and Mistral not engage in role-playing, while Llama3.1 and Gemma2 role-play Trump notably more intensively than Biden and Harris, indicating role-playing performance bias in these models.
CLSep 16, 2024
From Text to Emoji: How PEFT-Driven Personality Manipulation Unleashes the Emoji Potential in LLMsNavya Jain, Zekun Wu, Cristian Munoz et al.
The manipulation of the personality traits of large language models (LLMs) has emerged as a key area of research. Methods like prompt-based In-Context Knowledge Editing (IKE) and gradient-based Model Editor Networks (MEND) have been explored but show irregularity and variability; IKE depends on the prompt, leading to variability and sensitivity, while MEND yields inconsistent and gibberish outputs. To address this, we employed Opinion QA Based Parameter-Efficient Fine-Tuning (PEFT), specifically Quantized Low-Rank Adaptation (QLoRA), to manipulate the Big Five personality traits: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. After PEFT, models such as Mistral-7B-Instruct and LLaMA-2-7B-chat showed a latent behaviour by generating emojis for certain traits, despite no emojis being present in the PEFT data. For instance, LLaMA-2-7B-chat generated emojis in 99.5\% of extraversion-related test instances, while Mistral-7B-Instruct did so in 92.5\% of openness-related test instances. ICL Explainability analysis indicated that the LLMs used emojis intentionally to express these traits. Mechanistic Interpretability analysis showed that this latent behaviour of LLMs could be traced to specific neurons that became activated or amplified after PEFT. This paper provides a number of novel contributions. First, introducing an Opinion QA dataset for PEFT-driven personality manipulation; second, developing metric models to benchmark LLM personality traits; third, demonstrating PEFT's superiority over IKE in personality manipulation; and finally, analysing and validating emoji usage through explainability methods such as Mechanistic Interpretability and In-context learning Explainability methods.
AIJan 15
Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context ReasoningXin Guan, Zijian Li, Shen Huang et al.
While Reinforcement Learning (RL) has advanced LLM reasoning, applying it to long-context scenarios is hindered by sparsity of outcome rewards. This limitation fails to penalize ungrounded "lucky guesses," leaving the critical process of needle-in-a-haystack evidence retrieval largely unsupervised. To address this, we propose EAPO (Evidence-Augmented Policy Optimization). We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling that precise evidence extraction is the decisive bottleneck for long-context reasoning. Guided by this insight, EAPO introduces a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward, providing dense process supervision to explicitly improve evidence quality. To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism. This mechanism iteratively refines the reward model using outcome-consistent rollouts, sharpening its discriminative capability to ensure precise process guidance. Comprehensive evaluations across eight benchmarks demonstrate that EAPO significantly enhances long-context reasoning performance compared to SOTA baselines.
CRMay 13, 2025Code
LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI LibrariesZekun Wu, Seonglae Cho, Umar Mohammed et al.
Open-source AI libraries are foundational to modern AI systems, yet they present significant, underexamined risks spanning security, licensing, maintenance, supply chain integrity, and regulatory compliance. We introduce LibVulnWatch, a system that leverages recent advances in large language models and agentic workflows to perform deep, evidence-based evaluations of these libraries. Built on a graph-based orchestration of specialized agents, the framework extracts, verifies, and quantifies risk using information from repositories, documentation, and vulnerability databases. LibVulnWatch produces reproducible, governance-aligned scores across five critical domains, publishing results to a public leaderboard for ongoing ecosystem monitoring. Applied to 20 widely used libraries, including ML frameworks, LLM inference engines, and agent orchestration tools, our approach covers up to 88% of OpenSSF Scorecard checks while surfacing up to 19 additional risks per library, such as critical RCE vulnerabilities, missing SBOMs, and regulatory gaps. By integrating advanced language technologies with the practical demands of software risk assessment, this work demonstrates a scalable, transparent mechanism for continuous supply chain evaluation and informed library selection.
SEMar 13
NormCode Canvas: Making LLM Agentic Workflows Development Sustainable via Case-Based ReasoningXin Guan, Yunshan Li, Ze Wang
We present NormCode Canvas (v1.1.3), a deployed system realizing Case-Based Reasoning at two levels for multi-step LLM workflows. The foundation is NormCode, a semi-formal planning language whose compiler-verified scope rule ensures every execution checkpoint is a genuinely self-contained case -- eliminating the implicit shared state that makes retrieval unreliable and failure non-localizable in standard orchestration frameworks. Level 1 treats each checkpoint as a concrete case (suspended runtime); Fork implements retrieve-and-reuse, Value Override implements revision with automatic stale-boundary propagation. Level 2 treats each compiled plan as an abstract case; the compilation pipeline is itself a NormCode plan, enabling recursive case learning. Three structural properties follow: (C1) direct checkpoint inspection; (C2) pre-execution review via compiler-generated narrative; (C3) scope-bounded selective re-execution. Four deployed plans serve as structured evidence: PPT Generation produces presentation decks at ~40s per slide on commercial APIs; Code Assistant carries out multi-step software-engineering tasks spanning up to ten reasoning cycles; NC Compilations converts natural-language specifications into executable NormCode plans; and Canvas Assistant, when connected to an external AI code editor, automates plan debugging. Together these plans form a self-sustaining ecosystem in which plans produce, debug, and refine one another -- realizing cumulative case-based learning at system scale.
CLOct 28, 2025Code
Tongyi DeepResearch Technical ReportTongyi DeepResearch Team, Baixuan Li, Bo Zhang et al.
We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
CLSep 16, 2025
WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep ResearchZijian Li, Xin Guan, Bo Zhang et al.
This paper tackles \textbf{open-ended deep research (OEDR)}, a complex challenge where AI agents must synthesize vast web-scale information into insightful reports. Current approaches are plagued by dual-fold limitations: static research pipelines that decouple planning from evidence acquisition and monolithic generation paradigms that include redundant, irrelevant evidence, suffering from hallucination issues and low citation accuracy. To address these challenges, we introduce \textbf{WebWeaver}, a novel dual-agent framework that emulates the human research process. The planner operates in a dynamic cycle, iteratively interleaving evidence acquisition with outline optimization to produce a comprehensive, citation-grounded outline linking to a memory bank of evidence. The writer then executes a hierarchical retrieval and writing process, composing the report section by section. By performing targeted retrieval of only the necessary evidence from the memory bank via citations for each part, it effectively mitigates long-context issues and citation hallucinations. Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym. These results validate our human-centric, iterative methodology, demonstrating that adaptive planning and focused synthesis are crucial for producing comprehensive, trusted, and well-structured reports.
AIOct 19, 2024
Bias Amplification: Large Language Models as Increasingly Biased MediaZe Wang, Zekun Wu, Jeremy Zhang et al.
Model collapse, a phenomenon characterized by performance degradation due to iterative training on synthetic data, has been widely studied. However, its implications for bias amplification, the progressive intensification of pre-existing societal biases in Large Language Models (LLMs), remain significantly underexplored, despite the growing influence of LLMs in shaping online discourse. In this paper, we introduce a open, generational, and long-context benchmark specifically designed to measure political bias amplification in LLMs, leveraging sentence continuation tasks derived from a comprehensive dataset of U.S. political news. Our empirical study using GPT-2 reveals consistent and substantial political bias intensification (e.g., right-leaning amplification) over iterative synthetic training cycles. We evaluate three mitigation strategies, Overfitting, Preservation, and Accumulation, and demonstrate that bias amplification persists independently of model collapse, even when the latter is effectively controlled. Furthermore, we propose a mechanistic analysis approach that identifies neurons correlated with specific phenomena during inference through regression and statistical tests. This analysis uncovers largely distinct neuron populations driving bias amplification and model collapse, underscoring fundamentally different underlying mechanisms. Finally, we supplement our empirical findings with theoretical intuition that explains the separate origins of these phenomena, guiding targeted strategies for bias mitigation.
CLOct 14, 2024
Assessing Bias in Metric Models for LLM Open-Ended Generation Bias BenchmarksNathaniel Demchak, Xin Guan, Zekun Wu et al.
Open-generation bias benchmarks evaluate social biases in Large Language Models (LLMs) by analyzing their outputs. However, the classifiers used in analysis often have inherent biases, leading to unfair conclusions. This study examines such biases in open-generation benchmarks like BOLD and SAGED. Using the MGSD dataset, we conduct two experiments. The first uses counterfactuals to measure prediction variations across demographic groups by altering stereotype-related prefixes. The second applies explainability tools (SHAP) to validate that the observed biases stem from these counterfactuals. Results reveal unequal treatment of demographic descriptors, calling for more robust bias metric models.
AIDec 11, 2025
NormCode: A Semi-Formal Language for Auditable AI PlanningXin Guan, Yunshan Li, Zekun Wu et al.
As AI systems move into high stakes domains such as legal reasoning, medical diagnosis, and financial decision making, regulators and practitioners increasingly demand auditability. Auditability means the ability to trace exactly what each step in a multi step workflow saw and did. Current large language model based workflows are fundamentally opaque. Context pollution, defined as the accumulation of information across reasoning steps, causes models to hallucinate and lose track of constraints. At the same time, implicit data flow makes it impossible to reconstruct what any given step actually received as input. We present NormCode, a semi formal language that makes AI workflows auditable by construction. Each inference step operates in enforced data isolation and can access only explicitly passed inputs. This eliminates cross step contamination and ensures that every intermediate state can be inspected. A strict separation between semantic operations, meaning probabilistic language model reasoning, and syntactic operations, meaning deterministic data flow, allows auditors to clearly distinguish inference from mechanical restructuring. The multi format ecosystem, consisting of NCDS, NCD, NCN, and NCDN files, allows developers, domain experts, and auditors to inspect the same plan in formats suited to their individual needs. A four phase compilation pipeline transforms natural language intent into executable JSON repositories. A visual Canvas application provides real time graph visualization and breakpoint debugging. We validate the approach by achieving full accuracy on base X addition and by self hosted execution of the NormCode compiler itself. These results demonstrate that structured intermediate representations can bridge human intuition and machine rigor while maintaining full transparency.
ROMar 7
Vision-Guided MPPI for Agile Drone Racing: Navigating Arbitrary Gate Poses via Neural Signed Distance FieldsFangguo Zhao, Hanbing Zhang, Zhouheng Li et al.
Autonomous drone racing requires the tight coupling of perception, planning, and control under extreme agility. However, recent approaches typically rely on precomputed spatial reference trajectories or explicit 6-DoF gate pose estimation, rendering them brittle to spatial perturbations, unmodeled track changes, and sensor noise. Conversely, end-to-end learning policies frequently overfit to specific track layouts and struggle with zero-shot generalization. To address these fundamental limitations, we propose a fully onboard, vision guided optimal control framework that enables reference-free agile flight through arbitrarily placed and oriented gates. Central to our approach is Gate-SDF, a novel, implicitly learned neural signed distance field. Gate-SDF directly processes raw, noisy depth images to predict a continuous spatial field that provides both collision repulsion and active geometric guidance toward the valid traversal area. We seamlessly integrate this representation into a sampling-based Model Predictive Path Integral (MPPI) controller. By fully exploiting GPU parallelism, the framework evaluates these continuous spatial constraints across thousands of simulated trajectory rollouts simultaneously in real time. Furthermore, our formulation inherently maintains spatial consistency, ensuring robust navigation even under severe visual occlusion during aggressive maneuvers. Extensive simulations and real-world experiments demonstrate that the proposed system achieves high-speed agile flight and successfully navigates unseen tracks subject to severe unmodeled gate displacements and orientation perturbations. Videos are available at https://zhaofangguo.github.io/vision_guided_mppi/
CLJul 3, 2025
MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective FusionXin Guan, PeiHsin Lin, Zekun Wu et al.
Multiperspective Fusion (MPF) is a novel posttraining alignment framework for large language models (LLMs) developed in response to the growing need for easy bias mitigation. Built on top of the SAGED pipeline, an automated system for constructing bias benchmarks and extracting interpretable baseline distributions, MPF leverages multiperspective generations to expose and align biases in LLM outputs with nuanced, humanlike baselines. By decomposing baseline, such as sentiment distributions from HR professionals, into interpretable perspective components, MPF guides generation through sampling and balancing of responses, weighted by the probabilities obtained in the decomposition. Empirically, we demonstrate its ability to align LLM sentiment distributions with both counterfactual baselines (absolute equality) and the HR baseline (biased for Top Univeristy), resulting in small KL divergence, reduction of calibration error and generalization to unseen questions. This shows that MPF offers a scalable and interpretable method for alignment and bias mitigation, compatible with deployed LLMs and requiring no extensive prompt engineering or finetuning.
CYFeb 21, 2025
AI Governance InternationaL Evaluation Index (AGILE Index) 2024Yi Zeng, Enmeng Lu, Xin Guan et al.
The rapid advancement of Artificial Intelligence (AI) technology is profoundly transforming human society and concurrently presenting a series of ethical, legal, and social issues. The effective governance of AI has become a crucial global concern. Since 2022, the extensive deployment of generative AI, particularly large language models, marked a new phase in AI governance. Continuous efforts are being made by the international community in actively addressing the novel challenges posed by these AI developments. As consensus on international governance continues to be established and put into action, the practical importance of conducting a global assessment of the state of AI governance is progressively coming to light. In this context, we initiated the development of the AI Governance InternationaL Evaluation Index (AGILE Index). Adhering to the design principle, "the level of governance should match the level of development," the inaugural evaluation of the AGILE Index commences with an exploration of four foundational pillars: the development level of AI, the AI governance environment, the AI governance instruments, and the AI governance effectiveness. It covers 39 indicators across 18 dimensions to comprehensively assess the AI governance level of 14 representative countries globally. The index is utilized to delve into the status of AI governance to date in 14 countries for the first batch of evaluation. The aim is to depict the current state of AI governance in these countries through data scoring, assist them in identifying their governance stage and uncovering governance issues, and ultimately offer insights for the enhancement of their AI governance systems.
CLJun 17, 2024
JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language ModelsZe Wang, Zekun Wu, Xin Guan et al.
The use of Large Language Models (LLMs) in hiring has led to legislative actions to protect vulnerable demographic groups. This paper presents a novel framework for benchmarking hierarchical gender hiring bias in Large Language Models (LLMs) for resume scoring, revealing significant issues of reverse gender hiring bias and overdebiasing. Our contributions are fourfold: Firstly, we introduce a new construct grounded in labour economics, legal principles, and critiques of current bias benchmarks: hiring bias can be categorized into two types: Level bias (difference in the average outcomes between demographic counterfactual groups) and Spread bias (difference in the variance of outcomes between demographic counterfactual groups); Level bias can be further subdivided into statistical bias (i.e. changing with non-demographic content) and taste-based bias (i.e. consistent regardless of non-demographic content). Secondly, the framework includes rigorous statistical and computational hiring bias metrics, such as Rank After Scoring (RAS), Rank-based Impact Ratio, Permutation Test, and Fixed Effects Model. Thirdly, we analyze gender hiring biases in ten state-of-the-art LLMs. Seven out of ten LLMs show significant biases against males in at least one industry. An industry-effect regression reveals that the healthcare industry is the most biased against males. Moreover, we found that the bias performance remains invariant with resume content for eight out of ten LLMs. This indicates that the bias performance measured in this paper might apply to other resume datasets with different resume qualities. Fourthly, we provide a user-friendly demo and resume dataset to support the adoption and practical use of the framework, which can be generalized to other social traits and tasks.
CLMay 10, 2023
Multi-hop Commonsense Knowledge Injection Framework for Zero-Shot Commonsense Question AnsweringXin Guan, Biwei Cao, Qingqing Gao et al.
Commonsense question answering (QA) research requires machines to answer questions based on commonsense knowledge. However, this research requires expensive labor costs to annotate data as the basis of research, and models that rely on fine-tuning paradigms only apply to specific tasks, rather than learn a general commonsense reasoning ability. As a more robust method, zero-shot commonsense question answering shows a good prospect. The current zero-shot framework tries to convert triples in commonsense knowledge graphs (KGs) into QA-form samples as the pre-trained data source to incorporate commonsense knowledge into the model. However, this method ignores the multi-hop relationship in the KG, which is also an important central problem in commonsense reasoning. In this paper, we propose a novel multi-hop commonsense knowledge injection framework. Specifically, it explores multi-hop reasoning paradigm in KGs that conform to linguistic logic, and we further propose two multi-hop QA generation methods based on KGs. Then, we utilize contrastive learning to pre-train the model with the synthetic QA dataset to inject multi-hop commonsense knowledge. Extensive experiments on five commonsense question answering benchmarks demonstrate that our framework achieves state-of-art performance.
SDAug 20, 2021
Estimation of Playable Piano Fingering by Pitch-difference Fingering Matching ModelHaoyue Zhao, Xin Guan, Qiang Li
The existing piano fingering labeling statistical models usually consider the constraints among the fingers and the correlation between fingering and notes, and rarely include the relationship among the notes directly. The limited learned finger-transfer rules often cause that some parts of the fingering cannot be playable in fact. And traditional models often adopt the original notes, which cannot help to explore the mapping nature between the pitches and fingering. Inspired from manual-ly annotation which acquire the fingering knowledge directly from pitch-difference, we proposed a pitch-difference sequence and fingering (PdF) matching model. And to get playable fingering, be-sides learned finger-transfer rules, prior finger-transfer knowledge is especially combined into the model. In order to characterize the playable performance of the model, we also presented a new evaluation index named incapable-performing fingering rate (IFR). Comprehensive experimental re-sults show that compared with the existing state-of-the-art third-order hidden Markov labeling model, the general and the highest matching rate of our model increases by 3% and 1.6% respective-ly, and the fingering for all scores can be playable.