CVAIOct 22, 2025

Evaluating ChatGPT's Performance in Classifying Pneumonia from Chest X-Ray Images

arXiv:2510.21839v1
Originality Synthesis-oriented
AI Analysis

This addresses the potential for AI models like ChatGPT in medical diagnostics, but it is incremental as it highlights current limitations rather than breakthroughs.

The study evaluated ChatGPT's zero-shot ability to classify pneumonia from chest X-ray images, finding that concise prompts achieved a peak accuracy of 74%, but performance was inconsistent across prompt designs.

In this study, we evaluate the ability of OpenAI's gpt-4o model to classify chest X-ray images as either NORMAL or PNEUMONIA in a zero-shot setting, without any prior fine-tuning. A balanced test set of 400 images (200 from each class) was used to assess performance across four distinct prompt designs, ranging from minimal instructions to detailed, reasoning-based prompts. The results indicate that concise, feature-focused prompts achieved the highest classification accuracy of 74\%, whereas reasoning-oriented prompts resulted in lower performance. These findings highlight that while ChatGPT exhibits emerging potential for medical image interpretation, its diagnostic reliability remains limited. Continued advances in visual reasoning and domain-specific adaptation are required before such models can be safely applied in clinical practice.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes