UniEM-3M: A Universal Electron Micrograph Dataset for Microstructural Segmentation and Generation
This addresses a bottleneck in deep learning-based microstructural characterization for materials science, though it is incremental as it builds on existing segmentation and generation techniques.
The authors tackled the scarcity of large-scale, diverse, and expert-annotated electron micrograph datasets by introducing UniEM-3M, a dataset with 5,091 high-resolution images and about 3 million instance segmentation labels, and demonstrated that their baseline model UniEM-Net outperforms other advanced methods on this benchmark.
Quantitative microstructural characterization is fundamental to materials science, where electron micrograph (EM) provides indispensable high-resolution insights. However, progress in deep learning-based EM characterization has been hampered by the scarcity of large-scale, diverse, and expert-annotated datasets, due to acquisition costs, privacy concerns, and annotation complexity. To address this issue, we introduce UniEM-3M, the first large-scale and multimodal EM dataset for instance-level understanding. It comprises 5,091 high-resolution EMs, about 3 million instance segmentation labels, and image-level attribute-disentangled textual descriptions, a subset of which will be made publicly available. Furthermore, we are also releasing a text-to-image diffusion model trained on the entire collection to serve as both a powerful data augmentation tool and a proxy for the complete data distribution. To establish a rigorous benchmark, we evaluate various representative instance segmentation methods on the complete UniEM-3M and present UniEM-Net as a strong baseline model. Quantitative experiments demonstrate that this flow-based model outperforms other advanced methods on this challenging benchmark. Our multifaceted release of a partial dataset, a generative model, and a comprehensive benchmark -- available at huggingface -- will significantly accelerate progress in automated materials analysis.