CYAIApr 20

First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows

arXiv:2604.1803844.0h-index: 2
Predicted impact top 48% in CY · last 90 daysOriginality Incremental advance
AI Analysis

For healthcare AI developers and regulators, this work provides a multi-metric bias evaluation framework and shows that agentic workflows can mitigate some explicit bias, though improvements are not uniform.

This study evaluates racial bias in five LLMs for medical text generation and diagnosis, finding that all models deviate from real-world distributions. An agentic workflow reduced explicit bias in DeepSeek V3, improving mean p-value by 0.0348 and median p-value by 0.1166.

Large language models (LLMs) are increasingly used in clinical settings, raising concerns about racial bias in both generated medical text and clinical reasoning. Existing studies have identified bias in medical LLMs, but many focus on single models and give less attention to mitigation. This study uses the EU AI Act as a governance lens to evaluate five widely used LLMs across two tasks, namely synthetic patient-case generation and differential diagnosis ranking. Using race-stratified epidemiological distributions in the United States and expert differential diagnosis lists as benchmarks, we apply structured prompt templates and a two-part evaluation design to examine implicit and explicit racial bias. All models deviated from observed racial distributions in the synthetic case generation task, with GPT-4.1 showing the smallest overall deviation. In the differential diagnosis task, DeepSeek V3 produced the strongest overall results across the reported metrics. When embedded in an agentic workflow, DeepSeek V3 showed an improvement of 0.0348 in mean p-value, 0.1166 in median p-value, and 0.0949 in mean difference relative to the standalone model, although improvement was not uniform across every metric. These findings support multi-metric bias evaluation for AI systems used in medical settings and suggest that retrieval-based agentic workflows may reduce some forms of explicit bias in benchmarked diagnostic tasks. Detailed prompt templates, experimental datasets, and code pipelines are available on our GitHub.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes