EpiCurveBench: Evaluating VLMs on Epidemic Curve Digitization
For researchers and practitioners in chart-to-data extraction and public health, this work addresses the lack of temporally-aware benchmarks and metrics for time-series chart extraction, which is crucial for unlocking outbreak data from published figures.
The paper introduces EpiCurveBench, a benchmark of 1,000 epidemic curve images, and EpiCurveSimilarity (ECS), a metric that tolerates temporal shifts and gaps. The best model achieves only 52.3% ECS, and ECS provides better discrimination than existing metrics, correlating 1.5-3.6 times more strongly with downstream epidemiological statistics than Dynamic Time Warping.
Chart-to-data extraction with vision-language models (VLMs) is increasingly evaluated on benchmarks that show diminishing headroom (frontier VLMs exceed 89% on ChartQA) and with metrics that treat extracted points as unordered key-value pairs, ignoring the temporal structure of time series and penalizing small alignment shifts as catastrophic failures. We address both gaps with EpiCurveBench, a benchmark of 1,000 real-world epidemic curve images curated from diverse public-health sources, and EpiCurveSimilarity (ECS), an evaluation metric that aligns predicted and ground-truth series via dynamic programming, tolerating local temporal shifts and gaps while penalizing them proportionally. Evaluating six methods--three frontier closed VLMs, one open VLM, and two specialized chart-extraction systems--we find the strongest model reaches only 52.3% ECS, and that ECS spreads the four general-purpose VLMs over a 25-point range where key-value metrics (RMS, SCRM) compress them into a 5-point band. We further validate ECS against four downstream epidemiological summary statistics, finding that higher ECS predicts smaller errors in total counts, peak timing, and peak magnitude, and higher growth-rate fidelity; across all four, ECS correlates 1.5--3.6 times more strongly than Dynamic Time Warping, which lacks a gap penalty and therefore cannot distinguish a truncated prediction from a temporally faithful one. EpiCurveBench targets a high-impact public-health application--unlocking decades of outbreak data trapped in published figures--but the benchmark and metric apply directly to any structured time-series chart-extraction setting.