CV MTRL-SCI CL LGJun 16, 2025

Stress-Testing Multimodal Foundation Models for Crystallographic Reasoning

Can Polat, Hasan Kurban, Erchin Serpedin, Mustafa Kurban

arXiv:2506.13051v110.22 citationsh-index: 18Has CodeProceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM)

Originality Synthesis-oriented

AI Analysis

This work addresses the need for robust evaluation of foundation models in materials science, providing a domain-specific benchmark for crystallographic reasoning.

The authors tackled the problem of evaluating multimodal foundation models for crystallographic reasoning by introducing a multiscale multicrystal dataset with two physically grounded benchmarks, Spatial-Exclusion and Compositional-Exclusion, to test generalization, consistency, and reliability, resulting in a reproducible framework with metrics like relative errors and physics-consistency indices.

Evaluating foundation models for crystallographic reasoning requires benchmarks that isolate generalization behavior while enforcing physical constraints. This work introduces a multiscale multicrystal dataset with two physically grounded evaluation protocols to stress-test multimodal generative models. The Spatial-Exclusion benchmark withholds all supercells of a given radius from a diverse dataset, enabling controlled assessments of spatial interpolation and extrapolation. The Compositional-Exclusion benchmark omits all samples of a specific chemical composition, probing generalization across stoichiometries. Nine vision--language foundation models are prompted with crystallographic images and textual context to generate structural annotations. Responses are evaluated via (i) relative errors in lattice parameters and density, (ii) a physics-consistency index penalizing volumetric violations, and (iii) a hallucination score capturing geometric outliers and invalid space-group predictions. These benchmarks establish a reproducible, physically informed framework for assessing generalization, consistency, and reliability in large-scale multimodal models. Dataset and code are available at https://github.com/KurbanIntelligenceLab/StressTestingMMFMinCR.

View on arXiv PDF Code

Similar