AIMay 28
Small Agent Group is the Future of Digital HealthYuqiao Meng, Luoxi Tang, Dazheng Zhang et al.
The rapid adoption of large language models (LLMs) in digital health has been driven by a "scaling-first" philosophy, i.e., the assumption that clinical intelligence increases with model size and data. However, real-world clinical needs include not only effectiveness, but also reliability and reasonable deployment cost. Since clinical decision-making is inherently collaborative, we challenge the monolithic scaling paradigm and ask whether a Small Agent Group (SAG) can support better clinical reasoning. SAG shifts from single-model intelligence to collective expertise by distributing reasoning, evidence-based analysis, and critical audit through a collaborative deliberation process. To assess the clinical utility of SAG, we conduct extensive evaluations using diverse clinical metrics spanning effectiveness, reliability, and deployment cost. Our results show that SAG achieves superior performance compared to a single giant model, both with and without additional optimization or retrieval-augmented generation. These findings suggest that the synergistic reasoning represented by SAG can substitute for model parameter growth in clinical settings. Overall, SAG offers a scalable solution to digital health that better balances effectiveness, reliability, and deployment efficiency.
HCMay 22
Improving Clinical Data Accessibility Through Automated FHIR Data Transformation ToolsAdarsh Pawar, Yuqiao Meng, Luoxi Tang et al.
The Fast Healthcare Interoperability Resources (FHIR) standard has emerged as a widely adopted specification for exchanging structured clinical data across healthcare systems. However, raw FHIR resources are often complex, verbose, and difficult for clinicians and analysts to interpret without specialized tooling. This paper presents a lightweight, browser-based system that improves the accessibility of FHIR data by automatically transforming raw JSON resources into human-readable PDF and Excel reports, along with interactive data visualizations. The system supports both remote retrieval of FHIR resources from server endpoints and the upload of local FHIR JSON files, enabling both online and offline analysis. Using a modular React architecture with jsPDF, xlsx, and Recharts, the tool parses, normalizes, visualizes, and exports FHIR data in an intuitive format. Evaluation results demonstrate that the system enhances interpretability and usability while preserving the semantic integrity of FHIR structures. Limitations and future extensions, including expanded FHIR profile support and clinical validation, are discussed.
CRMay 14
SafeGPT: Preventing Data Leakage and Unethical Outputs in Enterprise LLM UsePratyush Desai, Luoxi Tang, Yuqiao Meng et al.
Large Language Models (LLMs) are transforming enterprise workflows but introduce security and ethics challenges when employees inadvertently share confidential data or generate policy-violating content. This paper proposes SafeGPT, a two-sided guardrail system preventing sensitive data leakage and unethical outputs. SafeGPT integrates input-side detection/redaction, output-side moderation/reframing, and human-in-the-loop feedback. Experiments demonstrate SafeGPT effectively reduces data leakage risk and biased outputs while maintaining satisfaction.
LGMay 10
The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic MemoryLuoxi Tang, Rupali Rajendra Vaje, Yuqiao Meng et al.
Agentic memory enables LLMs to persist information beyond a single context window and reuse it in later decisions, but it also introduces a new vulnerability: spurious correlations, where retrieved memory carries miscorrelated evidence and propagates erroneous reasoning into downstream decisions. Despite the widespread use of agentic memory, this risk remains largely underexplored. We address it from two aspects. First, we benchmark several canonical types of spurious patterns identified through causal structure and record them across trajectory-level memory. Diagnosing agentic memory systems on this benchmark reveals that memory improves reasoning on clean inputs but amplifies reliance on spurious patterns when they are present. Second, we propose CAMEL, a plug-and-play calibration method that operates across diverse memory architectures at both write and retrieval time. CAMEL consistently reduces reliance on spurious patterns across all three types while preserving or improving performance on clean inputs and staying robust under adaptive attacks targeting the calibration. Overall, CAMEL offers a principled and lightweight solution toward more reliable agentic memory deployment.
AIMay 10
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic EquilibriumYuqiao Meng, Sakshi Sunil Narvekar, Luoxi Tang et al.
Multi-agent debate (MAD) systems increasingly rely on shared memory to support long-horizon reasoning, but this convenience opens a critical vulnerability: a single corrupted entry can contaminate the downstream memory-augmented reasoning, and debate alone fails to filter such errors. Existing safeguards filter entries via heuristics or LLM-based validation, yet they rely on AI judgments that share the same failure modes and overlook the cross-agent dynamics of MAD. We address this gap by formulating memory updating in MAD as a zero-trust memory game, in which no agent is assumed honest and the game's equilibrium serves as an indicator of optimal memory trust. Guided by this equilibrium, we propose EquiMem, an inference-time calibration mechanism that quantifies each update algorithmically against the shared memory state, using agents' existing retrieval queries and traversal paths as evidence rather than soliciting any LLM judgment. EquiMem instantiates calibration for both embedding- and graph-based memory, and across diverse benchmarks, MAD frameworks, and memory architectures, it consistently outperforms existing safeguards, remains robust under adversarial agents, and incurs negligible inference overhead.
MAFeb 6
The Value of Variance: Mitigating Debate Collapse in Multi-Agent Systems via Uncertainty-Driven Policy OptimizationLuoxi Tang, Yuqiao Meng, Joseph Costa et al.
Multi-agent debate (MAD) systems improve LLM reasoning through iterative deliberation, but remain vulnerable to debate collapse, a failure type where final agent decisions are compromised on erroneous reasoning. Existing methods lack principled mechanisms to detect or prevent such failures. To address this gap, we first propose a hierarchical metric that quantifies behavioral uncertainty at three levels: intra-agent (individual reasoning uncertainty), inter-agent (interactive uncertainty), and system-level (output uncertainty). Empirical analysis across several benchmarks reveals that our proposed uncertainty quantification reliably indicates system failures, which demonstrates the validity of using them as diagnostic metrics to indicate the system failure. Subsequently, we propose a mitigation strategy by formulating an uncertainty-driven policy optimization to penalize self-contradiction, peer conflict, and low-confidence outputs in a dynamic debating environment. Experiments demonstrate that our proposed uncertainty-driven mitigation reliably calibrates the multi-agent system by consistently improving decision accuracy while reducing system disagreement.
SEJan 8
RiskBridge: Turning CVEs into Business-Aligned Patch PrioritiesYelena Mujibur Sheikh, Awez Akhtar Khatik, Luoxi Tang et al.
Enterprises are confronted with an unprecedented escalation in cybersecurity vulnerabilities, with thousands of new CVEs disclosed each month. Conventional prioritization frameworks such as CVSS offer static severity metrics that fail to account for exploit probability, compliance urgency, and operational impact, resulting in inefficient and delayed remediation. This paper introduces RiskBridge, an explainable and compliance-aware vulnerability management framework that integrates multi-source intelligence from CVSS v4, EPSS, and CISA KEV to produce dynamic, business -- aligned patch priorities. RiskBridge employs a probabilistic Zero-Day Exposure Simulation (ZDES) model to forecast near-term exploit likelihood, a Policy-as-Code Engine to translate regulatory mandates (e.g., PCI DSS, NIST SP 800-53) into automated SLA logic, and an ROI-driven Optimizer to maximize cumulative risk reduction per remediation effort. Experimental evaluations using live CVE datasets demonstrate an 88% reduction in residual risk, an 18-day improvement in SLA compliance, and a 35% increase in remediation efficiency compared to state-of-the-art commercial baselines. These findings validate RiskBridge as a practical and auditable decision-intelligence system that unifies probabilistic modeling, compliance reasoning, and optimization analytics. The framework represents a step toward automated, explainable, and business-centric vulnerability management in modern enterprise environments
CRJan 9
Smart Privacy Policy Assistant: An LLM-Powered System for Transparent and Actionable Privacy NoticesSriharshini Kalvakuntla, Luoxi Tang, Yuqiao Meng et al.
Most users agree to online privacy policies without reading or understanding them, even though these documents govern how personal data is collected, shared, and monetized. Privacy policies are typically long, legally complex, and difficult for non-experts to interpret. This paper presents the Smart Privacy Policy Assistant, an LLM-powered system that automatically ingests privacy policies, extracts and categorizes key clauses, assigns human-interpretable risk levels, and generates clear, concise explanations. The system is designed for real-time use through browser extensions or mobile interfaces, surfacing contextual warnings before users disclose sensitive information or grant risky permissions. We describe the end-to-end pipeline, including policy ingestion, clause categorization, risk scoring, and explanation generation, and propose an evaluation framework based on clause-level accuracy, policy-level risk agreement, and user comprehension.
CLJan 9
Semantic NLP Pipelines for Interoperable Patient Digital Twins from Unstructured EHRsRafael Brens, Yuqiao Meng, Luoxi Tang et al.
Digital twins -- virtual replicas of physical entities -- are gaining traction in healthcare for personalized monitoring, predictive modeling, and clinical decision support. However, generating interoperable patient digital twins from unstructured electronic health records (EHRs) remains challenging due to variability in clinical documentation and lack of standardized mappings. This paper presents a semantic NLP-driven pipeline that transforms free-text EHR notes into FHIR-compliant digital twin representations. The pipeline leverages named entity recognition (NER) to extract clinical concepts, concept normalization to map entities to SNOMED-CT or ICD-10, and relation extraction to capture structured associations between conditions, medications, and observations. Evaluation on MIMIC-IV Clinical Database Demo with validation against MIMIC-IV-on-FHIR reference mappings demonstrates high F1-scores for entity and relation extraction, with improved schema completeness and interoperability compared to baseline methods.
NIJan 9
Adversarial Network Imagination: Causal LLMs and Digital Twins for Proactive Telecom MitigationVignesh Sriram, Yuqiao Meng, Luoxi Tang et al.
Telecommunication networks experience complex failures such as fiber cuts, traffic overloads, and cascading outages. Existing monitoring and digital twin systems are largely reactive, detecting failures only after service degradation occurs. We propose Adversarial Network Imagination, a closed-loop framework that integrates a Causal Large Language Model (LLM), a Knowledge Graph, and a Digital Twin to proactively generate, simulate, and evaluate adversarial network failures. The Causal LLM produces structured failure scenarios grounded in network dependencies encoded in the Knowledge Graph. These scenarios are executed within a Digital Twin to measure performance degradation and evaluate mitigation strategies. By iteratively refining scenarios based on simulation feedback, the framework shifts network operations from reactive troubleshooting toward anticipatory resilience analysis.
AIMay 17, 2025
On the Eligibility of LLMs for Counterfactual Reasoning: A Decompositional StudyShuai Yang, Qi Yang, Luoxi Tang et al.
Counterfactual reasoning has emerged as a crucial technique for generalizing the reasoning capabilities of large language models (LLMs). By generating and analyzing counterfactual scenarios, researchers can assess the adaptability and reliability of model decision-making. Although prior work has shown that LLMs often struggle with counterfactual reasoning, it remains unclear which factors most significantly impede their performance across different tasks and modalities. In this paper, we propose a decompositional strategy that breaks down the counterfactual generation from causality construction to the reasoning over counterfactual interventions. To support decompositional analysis, we investigate 11 datasets spanning diverse tasks, including natural language understanding, mathematics, programming, and vision-language tasks. Through extensive evaluations, we characterize LLM behavior across each decompositional stage and identify how modality type and intermediate reasoning influence performance. By establishing a structured framework for analyzing counterfactual reasoning, this work contributes to the development of more reliable LLM-based reasoning systems and informs future elicitation strategies.
CROct 2, 2025
POLAR: Automating Cyber Threat Prioritization through LLM-Powered AssessmentLuoxi Tang, Yuqiao Meng, Ankita Patra et al.
Large Language Models (LLMs) are intensively used to assist security analysts in counteracting the rapid exploitation of cyber threats, wherein LLMs offer cyber threat intelligence (CTI) to support vulnerability assessment and incident response. While recent work has shown that LLMs can support a wide range of CTI tasks such as threat analysis, vulnerability detection, and intrusion defense, significant performance gaps persist in practical deployments. In this paper, we investigate the intrinsic vulnerabilities of LLMs in CTI, focusing on challenges that arise from the nature of the threat landscape itself rather than the model architecture. Using large-scale evaluations across multiple CTI benchmarks and real-world threat reports, we introduce a novel categorization methodology that integrates stratification, autoregressive refinement, and human-in-the-loop supervision to reliably analyze failure instances. Through extensive experiments and human inspections, we reveal three fundamental vulnerabilities: spurious correlations, contradictory knowledge, and constrained generalization, that limit LLMs in effectively supporting CTI. Subsequently, we provide actionable insights for designing more robust LLM-powered CTI systems to facilitate future research.
CRSep 28, 2025
Uncovering Vulnerabilities of LLM-Assisted Cyber Threat IntelligenceYuqiao Meng, Luoxi Tang, Feiyang Yu et al.
Large Language Models (LLMs) are intensively used to assist security analysts in counteracting the rapid exploitation of cyber threats, wherein LLMs offer cyber threat intelligence (CTI) to support vulnerability assessment and incident response. While recent work has shown that LLMs can support a wide range of CTI tasks such as threat analysis, vulnerability detection, and intrusion defense, significant performance gaps persist in practical deployments. In this paper, we investigate the intrinsic vulnerabilities of LLMs in CTI, focusing on challenges that arise from the nature of the threat landscape itself rather than the model architecture. Using large-scale evaluations across multiple CTI benchmarks and real-world threat reports, we introduce a novel categorization methodology that integrates stratification, autoregressive refinement, and human-in-the-loop supervision to reliably analyze failure instances. Through extensive experiments and human inspections, we reveal three fundamental vulnerabilities: spurious correlations, contradictory knowledge, and constrained generalization, that limit LLMs in effectively supporting CTI. Subsequently, we provide actionable insights for designing more robust LLM-powered CTI systems to facilitate future research.
CRSep 28, 2025
Benchmarking LLM-Assisted Blue Teaming via Standardized Threat HuntingYuqiao Meng, Luoxi Tang, Feiyang Yu et al.
As cyber threats continue to grow in scale and sophistication, blue team defenders increasingly require advanced tools to proactively detect and mitigate risks. Large Language Models (LLMs) offer promising capabilities for enhancing threat analysis. However, their effectiveness in real-world blue team threat-hunting scenarios remains insufficiently explored. This paper presents CyberTeam, a benchmark designed to guide LLMs in blue teaming practice. CyberTeam constructs a standardized workflow in two stages. First, it models realistic threat-hunting workflows by capturing the dependencies among analytical tasks from threat attribution to incident response. Next, each task is addressed through a set of operational modules tailored to its specific analytical requirements. This transforms threat hunting into a structured sequence of reasoning steps, with each step grounded in a discrete operation and ordered according to task-specific dependencies. Guided by this framework, LLMs are directed to perform threat-hunting tasks through modularized steps. Overall, CyberTeam integrates 30 tasks and 9 operational modules to guide LLMs through standardized threat analysis. We evaluate both leading LLMs and state-of-the-art cybersecurity agents, comparing CyberTeam against open-ended reasoning strategies. Our results highlight the improvements enabled by standardized design, while also revealing the limitations of open-ended reasoning in real-world threat hunting.
CLMay 17, 2025
Are LLMs Ready for English Standardized Tests? A Benchmarking and Elicitation PerspectiveLuoxi Tang, Tharunya Sundar, Shuai Yang et al.
AI is transforming education by enabling powerful tools that enhance learning experiences. Among recent advancements, large language models (LLMs) hold particular promise for revolutionizing how learners interact with educational content. In this work, we investigate the potential of LLMs to support standardized test preparation by focusing on English Standardized Tests (ESTs). Specifically, we assess their ability to generate accurate and contextually appropriate solutions across a diverse set of EST question types. We introduce ESTBOOK, a comprehensive benchmark designed to evaluate the capabilities of LLMs in solving EST questions. ESTBOOK aggregates five widely recognized tests, encompassing 29 question types and over 10,576 questions across multiple modalities, including text, images, audio, tables, and mathematical symbols. Using ESTBOOK, we systematically evaluate both the accuracy and inference efficiency of LLMs. Additionally, we propose a breakdown analysis framework that decomposes complex EST questions into task-specific solution steps. This framework allows us to isolate and assess LLM performance at each stage of the reasoning process. Evaluation findings offer insights into the capability of LLMs in educational contexts and point toward targeted strategies for improving their reliability as intelligent tutoring systems.