CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials
This benchmark identifies a new, challenging task for vision-language models in materials science, revealing systematic failures in quantitative scientific figure understanding.
The authors introduce CrystalXRD-Bench, a 250-sample benchmark for evaluating vision-language models on XRD peak indexing. The best model (GPT-5.4) achieves a Jaccard score of 0.5888 and exact-match rate of 37.6%, with most models scoring below 0.50, indicating the task is far from solved.
Miller-index identification from powder XRD patterns requires capabilities untested by existing multimodal benchmarks: the model must read a narrow peak location from a rendered scientific curve and then connect that observation to multi-step crystallographic reasoning. We introduce CrystalXRD-Bench, a 250-sample benchmark built from 10 public crystallographic databases for a single task: recover the full set of HKLs contributing to the highest-intensity peak in an XRD pattern. Each sample pairs the rendered XRD image with the source CIF text and chemical formula, so visual extraction errors and reasoning errors can be examined side by side. We evaluate seven vision-language models. The best Jaccard score is 0.5888 (GPT-5.4) with an exact-match rate of 37.6%, yet six of seven models remain below Jaccard 0.50; the task is far from solved. Error patterns vary systematically: double-peak cases are especially brittle, recall-heavy models gain coverage by over-predicting HKLs, and access to CIF text does not close the gap in crystallographic calculation. Alongside model rankings, the benchmark identifies the conditions under which current VLMs fail on quantitative scientific figures. All data and evaluation code will be publicly available.