AINov 28, 2023Code
War and Peace (WarAgent): Large Language Model-based Multi-Agent Simulation of World WarsWenyue Hua, Lizhou Fan, Lingyao Li et al.
Can we avoid wars at the crossroads of history? This question has been pursued by individuals, scholars, policymakers, and organizations throughout human history. In this research, we attempt to answer the question based on the recent advances of Artificial Intelligence (AI) and Large Language Models (LLMs). We propose \textbf{WarAgent}, an LLM-powered multi-agent AI system, to simulate the participating countries, their decisions, and the consequences, in historical international conflicts, including the World War I (WWI), the World War II (WWII), and the Warring States Period (WSP) in Ancient China. By evaluating the simulation effectiveness, we examine the advancements and limitations of cutting-edge AI systems' abilities in studying complex collective human behaviors such as international conflicts under diverse settings. In these simulations, the emergent interactions among agents also offer a novel perspective for examining the triggers and conditions that lead to war. Our findings offer data-driven and AI-augmented insights that can redefine how we approach conflict resolution and peacekeeping strategies. The implications stretch beyond historical analysis, offering a blueprint for using AI to understand human history and possibly prevent future international conflicts. Code and data are available at \url{https://github.com/agiresearch/WarAgent}.
CLSep 27, 2024
Simulated patient systems powered by large language model-based AI agents offer potential for transforming medical educationHuizi Yu, Jiayan Zhou, Lingyao Li et al. · harvard
Background: Simulated patient systems are important in medical education and research, providing safe, integrative training environments and supporting clinical decision making. Advances in artificial intelligence (AI), especially large language models (LLMs), can enhance simulated patients by replicating medical conditions and doctor patient interactions with high fidelity and at low cost, but effectiveness and trustworthiness remain open challenges. Methods: We developed AIPatient, a simulated patient system powered by LLM based AI agents. The system uses a retrieval augmented generation (RAG) framework with six task specific agents for complex reasoning. To improve realism, it is linked to the AIPatient knowledge graph built from de identified real patient data in the MIMIC III intensive care database. Results: We evaluated electronic health record (EHR) based medical question answering (QA), readability, robustness, stability, and user experience. AIPatient reached 94.15 percent QA accuracy when all six agents were enabled, outperforming versions with partial or no agent integration. The knowledge base achieved an F1 score of 0.89. Readability scores showed a median Flesch Reading Ease of 68.77 and a median Flesch Kincaid Grade of 6.4, indicating accessibility for most medical trainees and clinicians. Robustness and stability were supported by non significant variance in repeated trials (analysis of variance F value 0.61, p greater than 0.1; F value 0.78, p greater than 0.1). A user study with medical students showed that AIPatient provides high fidelity, usability, and educational value, comparable to or better than human simulated patients for history taking. Conclusions: LLM based simulated patient systems can deliver accurate, readable, and reliable medical encounters and show strong potential to transform medical education.
DLApr 3, 2023
A Bibliometric Review of Large Language Models Research from 2017 to 2023Lizhou Fan, Lingyao Li, Zihui Ma et al.
Large language models (LLMs) are a class of language models that have demonstrated outstanding performance across a range of natural language processing (NLP) tasks and have become a highly sought-after research area, because of their ability to generate human-like language and their potential to revolutionize science and technology. In this study, we conduct bibliometric and discourse analyses of scholarly literature on LLMs. Synthesizing over 5,000 publications, this paper serves as a roadmap for researchers, practitioners, and policymakers to navigate the current landscape of LLMs research. We present the research trends from 2017 to early 2023, identifying patterns in research paradigms and collaborations. We start with analyzing the core algorithm developments and NLP tasks that are fundamental in LLMs research. We then investigate the applications of LLMs in various fields and domains including medicine, engineering, social science, and humanities. Our review also reveals the dynamic, fast-paced evolution of LLMs research. Overall, this paper offers valuable insights into the current state, impact, and potential of LLMs research and its applications.
CLApr 20, 2023
"HOT" ChatGPT: The promise of ChatGPT in detecting and discriminating hateful, offensive, and toxic comments on social mediaLingyao Li, Lizhou Fan, Shubham Atreja et al.
Harmful content is pervasive on social media, poisoning online communities and negatively impacting participation. A common approach to address this issue is to develop detection models that rely on human annotations. However, the tasks required to build such models expose annotators to harmful and offensive content and may require significant time and cost to complete. Generative AI models have the potential to understand and detect harmful content. To investigate this potential, we used ChatGPT and compared its performance with MTurker annotations for three frequently discussed concepts related to harmful content: Hateful, Offensive, and Toxic (HOT). We designed five prompts to interact with ChatGPT and conducted four experiments eliciting HOT classifications. Our results show that ChatGPT can achieve an accuracy of approximately 80% when compared to MTurker annotations. Specifically, the model displays a more consistent classification for non-HOT comments than HOT comments compared to human annotations. Our findings also suggest that ChatGPT classifications align with provided HOT definitions, but ChatGPT classifies "hateful" and "offensive" as subsets of "toxic." Moreover, the choice of prompts used to interact with ChatGPT impacts its performance. Based on these in-sights, our study provides several meaningful implications for employing ChatGPT to detect HOT content, particularly regarding the reliability and consistency of its performance, its understand-ing and reasoning of the HOT concept, and the impact of prompts on its performance. Overall, our study provides guidance about the potential of using generative AI models to moderate large volumes of user-generated content on social media.
CLSep 30, 2024
Wait, but Tylenol is Acetaminophen... Investigating and Improving Language Models' Ability to Resist Requests for MisinformationShan Chen, Mingye Gao, Kuleen Sasse et al. · harvard
Background: Large language models (LLMs) are trained to follow directions, but this introduces a vulnerability to blindly comply with user requests even if they generate wrong information. In medicine, this could accelerate the generation of misinformation that impacts human well-being. Objectives/Methods: We analyzed compliance to requests to generate misleading content about medications in settings where models know the request is illogical. We investigated whether in-context directions and instruction-tuning of LLMs to prioritize logical reasoning over compliance reduced misinformation risk. Results: While all frontier LLMs complied with misinformation requests, both prompt-based and parameter-based approaches can improve the detection of logic flaws in requests and prevent the dissemination of medical misinformation. Conclusion: Shifting LLMs to prioritize logic over compliance could reduce risks of exploitation for medical misinformation.
DLMar 10, 2022
Librarian-in-the-Loop: A Natural Language Processing Paradigm for Detecting Informal Mentions of Research Data in Academic LiteratureLizhou Fan, Sara Lafia, David Bleckley et al.
Data citations provide a foundation for studying research data impact. Collecting and managing data citations is a new frontier in archival science and scholarly communication. However, the discovery and curation of research data citations is labor intensive. Data citations that reference unique identifiers (i.e. DOIs) are readily findable; however, informal mentions made to research data are more challenging to infer. We propose a natural language processing (NLP) paradigm to support the human task of identifying informal mentions made to research datasets. The work of discovering informal data mentions is currently performed by librarians and their staff in the Inter-university Consortium for Political and Social Research (ICPSR), a large social science data archive that maintains a large bibliography of data-related literature. The NLP model is bootstrapped from data citations actively collected by librarians at ICPSR. The model combines pattern matching with multiple iterations of human annotations to learn additional rules for detecting informal data mentions. These examples are then used to train an NLP pipeline. The librarian-in-the-loop paradigm is centered in the data work performed by ICPSR librarians, supporting broader efforts to build a more comprehensive bibliography of data-related literature that reflects the scholarly communities of research data users.
DLMay 23, 2022
A Natural Language Processing Pipeline for Detecting Informal Data References in Academic LiteratureSara Lafia, Lizhou Fan, Libby Hemphill
Discovering authoritative links between publications and the datasets that they use can be a labor-intensive process. We introduce a natural language processing pipeline that retrieves and reviews publications for informal references to research datasets, which complements the work of data librarians. We first describe the components of the pipeline and then apply it to expand an authoritative bibliography linking thousands of social science studies to the data-related publications in which they are used. The pipeline increases recall for literature to review for inclusion in data-related collections of publications and makes it possible to detect informal data references at scale. We contribute (1) a novel Named Entity Recognition (NER) model that reliably detects informal data references and (2) a dataset connecting items from social science literature with datasets they reference. Together, these contributions enable future work on data reference, data citation networks, and data reuse.
90.3CYMay 24
LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human AlignmentLingyao Li, Deyi Li, Chen Chen et al.
Large language models (LLMs) are increasingly deployed across healthcare applications, including clinical documentation, diagnostic reasoning, medicine recommendation, and medical education. Their outputs are largely unstructured clinical text, which is difficult to reliably evaluate at scale. LLM-as-a-Judge, in which an LLM evaluates another system's output against task-specific criteria, offers a scalable alternative and is increasingly used in clinical evaluation, yet its validity in healthcare remains underexamined. Existing reviews focus on general-purpose LLM evaluation or on risk framework, rather than systematically characterizing how LLM-as-a-Judge is applied in healthcare and how well their judgments align with human experts. We therefore conduct a PRISMA-guided comprehensive review of LLM-as-a-Judge applications in healthcare, searching five databases for studies published between January 2023 and February 2026. After screening 541 records, 134 studies meet the eligibility and are coded by health scenario, judge configuration, technical approach, and validation design. LLM-as-a-Judge is concentrated in clinical decision support, clinical natural language processing (NLP), medical knowledge and question answering (QA), and medical communication. OpenAI models are the most frequently used judges, and prompt engineering appears in nearly all studies, with ensemble, multi-agent, and retrieval-augmented designs as common extensions. Among studies reporting human validation, LLM judges often show moderate to strong alignment with expert judgments, although reliability varies substantially across tasks. Overall, this review positions LLM-as-a-Judge as a promising framework for scalable healthcare AI evaluation, while emphasizing that its clinical value depends on model design and rigorous validation.
AIDec 22, 2023Code
NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity ClassesLizhou Fan, Wenyue Hua, Lingyao Li et al.
Complex reasoning ability is one of the most important features of current LLMs, which has also been leveraged to play an integral role in complex decision-making tasks. Therefore, the investigation into the reasoning capabilities of Large Language Models (LLMs) is critical: numerous benchmarks have been established to assess the reasoning abilities of LLMs. However, current benchmarks are inadequate in offering a rigorous evaluation of the full extent of reasoning abilities that LLMs are capable of achieving. They are also prone to the risk of overfitting, as these benchmarks, being publicly accessible and static, allow models to potentially tailor their responses to specific benchmark metrics, thereby inflating their performance. Addressing these limitations, our research introduces a new benchmark, named NPHardEval. This benchmark is designed to evaluate the reasoning abilities of LLMs across a broad spectrum of 900 algorithmic questions, extending up to the NP-Hard complexity class. These questions are meticulously chosen to represent a wide range of complexity class below the NP-hard complexity class, offering a rigorous measure of the reasoning ability of LLMs. Through this study, we shed light on the current state of reasoning in LLMs, providing an objective and rigorous perspective through the comparison of LLMs' performance across complex classes. Moreover, this benchmark is designed with a dynamic update mechanism, where the datapoints are refreshed on a monthly basis. Such regular updates play a crucial role in mitigating the risk of LLMs overfitting to the benchmark, promoting a more accurate and reliable assessment of their reasoning capabilities. The benchmark dataset and code of NPHardEval are available at https://github.com/casmlab/NPHardEval.
SIAug 21, 2024
Characterizing Online Toxicity During the 2022 Mpox Outbreak: A Computational Analysis of Topical and Network DynamicsLizhou Fan, Lingyao Li, Libby Hemphill
Background: Online toxicity, encompassing behaviors such as harassment, bullying, hate speech, and the dissemination of misinformation, has become a pressing social concern in the digital age. The 2022 Mpox outbreak, initially termed "Monkeypox" but subsequently renamed to mitigate associated stigmas and societal concerns, serves as a poignant backdrop to this issue. Objective: In this research, we undertake a comprehensive analysis of the toxic online discourse surrounding the 2022 Mpox outbreak. Our objective is to dissect its origins, characterize its nature and content, trace its dissemination patterns, and assess its broader societal implications, with the goal of providing insights that can inform strategies to mitigate such toxicity in future crises. Methods: We collected more than 1.6 million unique tweets and analyzed them from five dimensions, including context, extent, content, speaker, and intent. Utilizing BERT-based topic modeling and social network community clustering, we delineated the toxic dynamics on Twitter. Results: We identified five high-level topic categories in the toxic online discourse on Twitter, including disease (46.6%), health policy and healthcare (19.3%), homophobia (23.9%), politics (6.0%), and racism (4.1%). Through the toxicity diffusion networks of mentions, retweets, and the top users, we found that retweets of toxic content were widespread, while influential users rarely engaged with or countered this toxicity through retweets. Conclusions: By tracking topical dynamics, we can track the changing popularity of toxic content online, providing a better understanding of societal challenges. Network dynamics spotlight key social media influencers and their intents, indicating that addressing these central figures in toxic discourse can enhance crisis communication and inform policy-making.
CLFeb 6
Evaluating an evidence-guided reinforcement learning framework in aligning light-parameter large language models with decision-making cognition in psychiatric clinical reasoningXinxin Lin, Guangxin Dai, Yi Zhong et al.
Large language models (LLMs) hold transformative potential for medical decision support yet their application in psychiatry remains constrained by hallucinations and superficial reasoning. This limitation is particularly acute in light-parameter LLMs which are essential for privacy-preserving and efficient clinical deployment. Existing training paradigms prioritize linguistic fluency over structured clinical logic and result in a fundamental misalignment with professional diagnostic cognition. Here we introduce ClinMPO, a reinforcement learning framework designed to align the internal reasoning of LLMs with professional psychiatric practice. The framework employs a specialized reward model trained independently on a dataset derived from 4,474 psychiatry journal articles and structured according to evidence-based medicine principles. We evaluated ClinMPO on a unseen subset of the benchmark designed to isolate reasoning capabilities from rote memorization. This test set comprises items where leading large-parameter LLMs consistently fail. We compared the ClinMPO-aligned light LLM performance against a cohort of 300 medical students. The ClinMPO-tuned Qwen3-8B model achieved a diagnostic accuracy of 31.4% and surpassed the human benchmark of 30.8% on these complex cases. These results demonstrate that medical evidence-guided optimization enables light-parameter LLMs to master complex reasoning tasks. Our findings suggest that explicit cognitive alignment offers a scalable pathway to reliable and safe psychiatric decision support.
AINov 8, 2024Code
Game-theoretic LLM: Agent Workflow for Negotiation GamesWenyue Hua, Ollie Liu, Lingyao Li et al.
This paper investigates the rationality of large language models (LLMs) in strategic decision-making contexts, specifically within the framework of game theory. We evaluate several state-of-the-art LLMs across a spectrum of complete-information and incomplete-information games. Our findings reveal that LLMs frequently deviate from rational strategies, particularly as the complexity of the game increases with larger payoff matrices or deeper sequential trees. To address these limitations, we design multiple game-theoretic workflows that guide the reasoning and decision-making processes of LLMs. These workflows aim to enhance the models' ability to compute Nash Equilibria and make rational choices, even under conditions of uncertainty and incomplete information. Experimental results demonstrate that the adoption of these workflows significantly improves the rationality and robustness of LLMs in game-theoretic tasks. Specifically, with the workflow, LLMs exhibit marked improvements in identifying optimal strategies, achieving near-optimal allocations in negotiation scenarios, and reducing susceptibility to exploitation during negotiations. Furthermore, we explore the meta-strategic considerations of whether it is rational for agents to adopt such workflows, recognizing that the decision to use or forgo the workflow constitutes a game-theoretic issue in itself. Our research contributes to a deeper understanding of LLMs' decision-making capabilities in strategic contexts and provides insights into enhancing their rationality through structured workflows. The findings have implications for the development of more robust and strategically sound AI agents capable of navigating complex interactive environments. Code and data supporting this study are available at \url{https://github.com/Wenyueh/game_theory}.
CVAug 29, 2024
Enhancing Autism Spectrum Disorder Early Detection with the Parent-Child Dyads Block-Play Protocol and an Attention-enhanced GCN-xLSTM Hybrid Deep Learning FrameworkXiang Li, Lizhou Fan, Hanbo Wu et al.
Autism Spectrum Disorder (ASD) is a rapidly growing neurodevelopmental disorder. Performing a timely intervention is crucial for the growth of young children with ASD, but traditional clinical screening methods lack objectivity. This study introduces an innovative approach to early detection of ASD. The contributions are threefold. First, this work proposes a novel Parent-Child Dyads Block-Play (PCB) protocol, grounded in kinesiological and neuroscientific research, to identify behavioral patterns distinguishing ASD from typically developing (TD) toddlers. Second, we have compiled a substantial video dataset, featuring 40 ASD and 89 TD toddlers engaged in block play with parents. This dataset exceeds previous efforts on both the scale of participants and the length of individual sessions. Third, our approach to action analysis in videos employs a hybrid deep learning framework, integrating a two-stream graph convolution network with attention-enhanced xLSTM (2sGCN-AxLSTM). This framework is adept at capturing dynamic interactions between toddlers and parents by extracting spatial features correlated with upper body and head movements and focusing on global contextual information of action sequences over time. By learning these global features with spatio-temporal correlations, our 2sGCN-AxLSTM effectively analyzes dynamic human behavior patterns and demonstrates an unprecedented accuracy of 89.6\% in early detection of ASD. Our approach shows strong potential for enhancing early ASD diagnosis by accurately analyzing parent-child interactions, providing a critical tool to support timely and informed clinical decision-making.
CLFeb 8, 2024Code
EmojiPrompt: Generative Prompt Obfuscation for Privacy-Preserving Communication with Cloud-based LLMsSam Lin, Wenyue Hua, Zhenting Wang et al.
Cloud-based Large Language Models (LLMs) such as ChatGPT have become increasingly integral to daily operations. Nevertheless, they also introduce privacy concerns: firstly, numerous studies underscore the risks to user privacy posed by jailbreaking cloud-based LLMs; secondly, the LLM service providers have access to all user data, which deters individuals from confidently utilizing such services. To address such concerns, we propose a simple yet effective paradigm, EmojiPrompt, to protect user privacy. At its core, EmojiPrompt performs generative transformation, obfuscating private data within prompts with linguistic and non-linguistic elements before submitting them to cloud-based LLMs. We evaluate EmojiPrompt's performance across 8 datasets from various domains. We also propose simulated inference attacks to assess EmojiPrompt's ability to preserve user privacy. The results demonstrate that EmojiPrompt effectively obfuscates user private data, while largely maintaining, or even enhancing, performances compared to the unobfuscated version. Furthermore, EmojiPrompt's atomic-level obfuscation allows it to function exclusively with cloud-based LLMs. For source code, please refer to: https://github.com/agiresearch/EmojiCrypt.
CLJun 4, 2024Code
Disentangling Logic: The Role of Context in Large Language Model Reasoning CapabilitiesWenyue Hua, Kaijie Zhu, Lingyao Li et al.
This study intends to systematically disentangle pure logic reasoning and text understanding by investigating the contrast across abstract and contextualized logical problems from a comprehensive set of domains. We explore whether LLMs demonstrate genuine reasoning capabilities across various domains when the underlying logical structure remains constant. We focus on two main questions (1) Can abstract logical problems alone accurately benchmark an LLM's reasoning ability in real-world scenarios, disentangled from contextual support in practical settings? (2) Does fine-tuning LLMs on abstract logic problem generalize to contextualized logic problems and vice versa? To investigate these questions, we focus on standard propositional logic, specifically propositional deductive and abductive logic reasoning. In particular, we construct instantiated datasets for deductive and abductive reasoning with 4 levels of difficulty, encompassing 12 distinct categories or domains based on the categorization of Wikipedia. Our experiments aim to provide insights into disentangling context in logical reasoning and the true reasoning capabilities of LLMs and their generalization potential. The code and dataset are available at: https://github.com/agiresearch/ContextHub.
CLMar 4, 2024Code
NPHardEval4V: Dynamic Evaluation of Large Vision-Language Models with Effects of VisionXiang Li, Wenyue Hua, Kaijie Zhu et al.
Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multimodal understanding, yet their reasoning abilities remain underexplored. Existing benchmarks tend to focus on perception or text-based comprehension, offering limited insight into how well these models perform on structured, logic-driven tasks that require both visual and linguistic reasoning. To address this gap, we introduce NPHardEval4V, a multimodal benchmark suite grounded in four classical NP-hard problems: Knapsack, Set Cover, Traveling Salesperson, and Vertex Cover. Each task is presented through a combination of structured visual layouts and textual prompts, designed to assess the ability of LVLMs to perform combinatorial reasoning under visual-linguistic constraints. We evaluate a set of advanced open-source and closed-source vision-language models under a unified prompting and problem representation framework. This enables fair comparison across models and task types, while isolating key variables affecting performance. Our results show that while these models perform reasonably well on perception-based inputs, they struggle with global optimization, abstraction, and constraint satisfaction. No single model demonstrates consistent reasoning capability across all problem types, and common failure patterns reveal fundamental limitations in current architectures. By leveraging the structure and complexity of NP-hard problems, NPHardEval4V provides a scalable, interpretable, and challenging testbed for diagnosing reasoning behaviors in LVLMs. We hope this benchmark can support the community in building more robust, inference-capable multimodal systems. The benchmark dataset and code are available at https://github.com/lizhouf/NPHardEval4.
CLFeb 26, 2025
Medical Hallucinations in Foundation Models and Their Impact on HealthcareYubin Kim, Hyewon Jeong, Shan Chen et al.
Hallucinations in foundation models arise from autoregressive training objectives that prioritize token-likelihood optimization over epistemic accuracy, fostering overconfidence and poorly calibrated uncertainty. We define medical hallucination as any model-generated output that is factually incorrect, logically inconsistent, or unsupported by authoritative clinical evidence in ways that could alter clinical decisions. We evaluated 11 foundation models (7 general-purpose, 4 medical-specialized) across seven medical hallucination tasks spanning medical reasoning and biomedical information retrieval. General-purpose models achieved significantly higher proportions of hallucination-free responses than medical-specialized models (median: 76.6% vs 51.3%, difference = 25.2%, 95% CI: 18.7-31.3%, Mann-Whitney U = 27.0, p = 0.012, rank-biserial r = -0.64). Top-performing models such as Gemini-2.5 Pro exceeded 97% accuracy when augmented with chain-of-thought prompting (base: 87.6%), while medical-specialized models like MedGemma ranged from 28.6-61.9% despite explicit training on medical corpora. Chain-of-thought reasoning significantly reduced hallucinations in 86.4% of tested comparisons after FDR correction (q < 0.05), demonstrating that explicit reasoning traces enable self-verification and error detection. Physician audits confirmed that 64-72% of residual hallucinations stemmed from causal or temporal reasoning failures rather than knowledge gaps. A global survey of clinicians (n = 70) validated real-world impact: 91.8% had encountered medical hallucinations, and 84.7% considered them capable of causing patient harm. The underperformance of medical-specialized models despite domain training indicates that safety emerges from sophisticated reasoning capabilities and broad knowledge integration developed during large-scale pre-training, not from narrow optimization.
DLMar 24, 2024
Large Language Models in Biomedical and Health Informatics: A Review with Bibliometric AnalysisHuizi Yu, Lizhou Fan, Lingyao Li et al.
Large Language Models (LLMs) have rapidly become important tools in Biomedical and Health Informatics (BHI), enabling new ways to analyze data, treat patients, and conduct research. This study aims to provide a comprehensive overview of LLM applications in BHI, highlighting their transformative potential and addressing the associated ethical and practical challenges. We reviewed 1,698 research articles from January 2022 to December 2023, categorizing them by research themes and diagnostic categories. Additionally, we conducted network analysis to map scholarly collaborations and research dynamics. Our findings reveal a substantial increase in the potential applications of LLMs to a variety of BHI tasks, including clinical decision support, patient interaction, and medical document analysis. Notably, LLMs are expected to be instrumental in enhancing the accuracy of diagnostic tools and patient care protocols. The network analysis highlights dense and dynamically evolving collaborations across institutions, underscoring the interdisciplinary nature of LLM research in BHI. A significant trend was the application of LLMs in managing specific disease categories such as mental health and neurological disorders, demonstrating their potential to influence personalized medicine and public health strategies. LLMs hold promising potential to further transform biomedical research and healthcare delivery. While promising, the ethical implications and challenges of model validation call for rigorous scrutiny to optimize their benefits in clinical settings. This survey serves as a resource for stakeholders in healthcare, including researchers, clinicians, and policymakers, to understand the current state and future potential of LLMs in BHI.
CLFeb 1, 2024
Health-LLM: Personalized Retrieval-Augmented Disease Prediction SystemQinkai Yu, Mingyu Jin, Dong Shu et al.
Recent advancements in artificial intelligence (AI), especially large language models (LLMs), have significantly advanced healthcare applications and demonstrated potentials in intelligent medical treatment. However, there are conspicuous challenges such as vast data volumes and inconsistent symptom characterization standards, preventing full integration of healthcare AI systems with individual patients' needs. To promote professional and personalized healthcare, we propose an innovative framework, Heath-LLM, which combines large-scale feature extraction and medical knowledge trade-off scoring. Compared to traditional health management applications, our system has three main advantages: (1) It integrates health reports and medical knowledge into a large model to ask relevant questions to large language model for disease prediction; (2) It leverages a retrieval augmented generation (RAG) mechanism to enhance feature extraction; (3) It incorporates a semi-automated feature updating framework that can merge and delete features to improve accuracy of disease prediction. We experiment on a large number of health reports to assess the effectiveness of Health-LLM system. The results indicate that the proposed system surpasses the existing ones and has the potential to significantly advance disease prediction and personalized health management.
HCApr 23, 2024
BattleAgent: Multi-modal Dynamic Emulation on Historical Battles to Complement Historical AnalysisShuhang Lin, Wenyue Hua, Lingyao Li et al.
This paper presents BattleAgent, an emulation system that combines the Large Vision-Language Model and Multi-agent System. This novel system aims to simulate complex dynamic interactions among multiple agents, as well as between agents and their environments, over a period of time. It emulates both the decision-making processes of leaders and the viewpoints of ordinary participants, such as soldiers. The emulation showcases the current capabilities of agents, featuring fine-grained multi-modal interactions between agents and landscapes. It develops customizable agent structures to meet specific situational requirements, for example, a variety of battle-related activities like scouting and trench digging. These components collaborate to recreate historical events in a lively and comprehensive manner while offering insights into the thoughts and feelings of individuals from diverse viewpoints. The technological foundations of BattleAgent establish detailed and immersive settings for historical battles, enabling individual agents to partake in, observe, and dynamically respond to evolving battle scenarios. This methodology holds the potential to substantially deepen our understanding of historical events, particularly through individual accounts. Such initiatives can also aid historical research, as conventional historical narratives often lack documentation and prioritize the perspectives of decision-makers, thereby overlooking the experiences of ordinary individuals. BattelAgent illustrates AI's potential to revitalize the human aspect in crucial social events, thereby fostering a more nuanced collective understanding and driving the progressive development of human society.
CLFeb 15, 2025
Toward Equitable Access: Leveraging Crowdsourced Reviews to Investigate Public Perceptions of Health Resource AccessibilityZhaoqian Xue, Guanhong Liu, Kai Wei et al.
Access to health resources is a critical determinant of public well-being and societal resilience, particularly during public health crises when demand for medical services and preventive care surges. However, disparities in accessibility persist across demographic and geographic groups, raising concerns about equity. Traditional survey methods often fall short due to limitations in coverage, cost, and timeliness. This study leverages crowdsourced data from Google Maps reviews, applying advanced natural language processing techniques, specifically ModernBERT, to extract insights on public perceptions of health resource accessibility in the United States during the COVID-19 pandemic. Additionally, we employ Partial Least Squares regression to examine the relationship between accessibility perceptions and key socioeconomic and demographic factors including political affiliation, racial composition, and educational attainment. Our findings reveal that public perceptions of health resource accessibility varied significantly across the U.S., with disparities peaking during the pandemic and slightly easing post-crisis. Political affiliation, racial demographics, and education levels emerged as key factors shaping these perceptions. These findings underscore the need for targeted interventions and policy measures to address inequities, fostering a more inclusive healthcare infrastructure that can better withstand future public health challenges.
CVOct 25, 2025
Cross-Enhanced Multimodal Fusion of Eye-Tracking and Facial Features for Alzheimer's Disease DiagnosisYujie Nie, Jianzhang Ni, Yonglong Ye et al.
Accurate diagnosis of Alzheimer's disease (AD) is essential for enabling timely intervention and slowing disease progression. Multimodal diagnostic approaches offer considerable promise by integrating complementary information across behavioral and perceptual domains. Eye-tracking and facial features, in particular, are important indicators of cognitive function, reflecting attentional distribution and neurocognitive state. However, few studies have explored their joint integration for auxiliary AD diagnosis. In this study, we propose a multimodal cross-enhanced fusion framework that synergistically leverages eye-tracking and facial features for AD detection. The framework incorporates two key modules: (a) a Cross-Enhanced Fusion Attention Module (CEFAM), which models inter-modal interactions through cross-attention and global enhancement, and (b) a Direction-Aware Convolution Module (DACM), which captures fine-grained directional facial features via horizontal-vertical receptive fields. Together, these modules enable adaptive and discriminative multimodal representation learning. To support this work, we constructed a synchronized multimodal dataset, including 25 patients with AD and 25 healthy controls (HC), by recording aligned facial video and eye-tracking sequences during a visual memory-search paradigm, providing an ecologically valid resource for evaluating integration strategies. Extensive experiments on this dataset demonstrate that our framework outperforms traditional late fusion and feature concatenation methods, achieving a classification accuracy of 95.11% in distinguishing AD from HC, highlighting superior robustness and diagnostic performance by explicitly modeling inter-modal dependencies and modality-specific contributions.
CLOct 24, 2025
DispatchMAS: Fusing taxonomy and artificial intelligence agents for emergency medical servicesXiang Li, Huizi Yu, Wenkong Wang et al.
Objective: Emergency medical dispatch (EMD) is a high-stakes process challenged by caller distress, ambiguity, and cognitive load. Large Language Models (LLMs) and Multi-Agent Systems (MAS) offer opportunities to augment dispatchers. This study aimed to develop and evaluate a taxonomy-grounded, LLM-powered multi-agent system for simulating realistic EMD scenarios. Methods: We constructed a clinical taxonomy (32 chief complaints, 6 caller identities from MIMIC-III) and a six-phase call protocol. Using this framework, we developed an AutoGen-based MAS with Caller and Dispatcher Agents. The system grounds interactions in a fact commons to ensure clinical plausibility and mitigate misinformation. We used a hybrid evaluation framework: four physicians assessed 100 simulated cases for "Guidance Efficacy" and "Dispatch Effectiveness," supplemented by automated linguistic analysis (sentiment, readability, politeness). Results: Human evaluation, with substantial inter-rater agreement (Gwe's AC1 > 0.70), confirmed the system's high performance. It demonstrated excellent Dispatch Effectiveness (e.g., 94 % contacting the correct potential other agents) and Guidance Efficacy (advice provided in 91 % of cases), both rated highly by physicians. Algorithmic metrics corroborated these findings, indicating a predominantly neutral affective profile (73.7 % neutral sentiment; 90.4 % neutral emotion), high readability (Flesch 80.9), and a consistently polite style (60.0 % polite; 0 % impolite). Conclusion: Our taxonomy-grounded MAS simulates diverse, clinically plausible dispatch scenarios with high fidelity. Findings support its use for dispatcher training, protocol evaluation, and as a foundation for real-time decision support. This work outlines a pathway for safely integrating advanced AI agents into emergency response workflows.