6.0AIMar 31
AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative ScaffoldingMoiz Sadiq Awan, Maryam Raza
Prior authorization remains one of the most burdensome administrative processes in U.S. healthcare, consuming billions of dollars and thousands of physician hours each year. While large language models have shown promise across clinical text tasks, their ability to produce submission-ready prior authorization letters has received only limited attention, with existing work confined to single-case demonstrations rather than structured multi-scenario evaluation. We assessed three commercially available LLMs (GPT-4o, Claude Sonnet 4.5, and Gemini 2.5 Pro) across 45 physician-validated synthetic scenarios spanning rheumatology, psychiatry, oncology, cardiology, and orthopedics. All three models generated letters with strong clinical content: accurate diagnoses, well-structured medical necessity arguments, and thorough step therapy documentation. However, a secondary analysis of real-world administrative requirements revealed consistent gaps that clinical scoring alone did not capture, including absent billing codes, missing authorization duration requests, and inadequate follow-up plans. These findings reframe the question: the challenge for clinical deployment is not whether LLMs can write clinically adequate letters, but whether the systems built around them can supply the administrative precision that payer workflows require.
3.3HCMar 26
Beyond Benchmarks: How Users Evaluate AI Chat AssistantsMoiz Sadiq Awan, Muhammad Haris Noor, Muhammad Salman Munaf
Automated benchmarks dominate the evaluation of large language models, yet no systematic study has compared user satisfaction, adoption motivations, and frustrations across competing platforms using a consistent instrument. We address this gap with a cross-platform survey of 388 active AI chat users, comparing satisfaction, adoption drivers, use case performance, and qualitative frustrations across seven major platforms: ChatGPT, Claude, Gemini, DeepSeek, Grok, Mistral, and Llama. Three broad findings emerge. First, the top three platforms (Claude, ChatGPT, and DeepSeek) receive statistically indistinguishable satisfaction ratings despite vast differences in funding, team size, and benchmark performance. Second, users treat these tools as interchangeable utilities rather than sticky ecosystems: over 80% use two or more platforms, and switching costs are negligible. Third, each platform attracts users for different reasons: ChatGPT for its interface, Claude for answer quality, DeepSeek through word-of-mouth, and Grok for its content policy, suggesting that specialization, not generalist dominance, sustains competition. Hallucination and content filtering remain the most common frustrations across all platforms. These findings offer an early empirical baseline for a market that benchmarks alone cannot characterize, and point toward competitive plurality rather than winner-take-all consolidation among engaged users.