LGMTRL-SCIFeb 28, 2025

Invariant Tokenization of Crystalline Materials for Language Model Enabled Generation

arXiv:2503.00152v120 citationsh-index: 12NIPS
Originality Incremental advance
AI Analysis

This addresses the challenge of invariant representation for crystal structures in materials science, enabling more reliable language model-based generation, though it appears incremental as it builds on existing language model approaches.

The paper tackles the problem of generating crystal materials using language models by proposing Mat2Seq, a method that converts 3D crystal structures into unique 1D sequences with SE(3) and periodic invariance, achieving promising performance compared to prior methods.

We consider the problem of crystal materials generation using language models (LMs). A key step is to convert 3D crystal structures into 1D sequences to be processed by LMs. Prior studies used the crystallographic information framework (CIF) file stream, which fails to ensure SE(3) and periodic invariance and may not lead to unique sequence representations for a given crystal structure. Here, we propose a novel method, known as Mat2Seq, to tackle this challenge. Mat2Seq converts 3D crystal structures into 1D sequences and ensures that different mathematical descriptions of the same crystal are represented in a single unique sequence, thereby provably achieving SE(3) and periodic invariance. Experimental results show that, with language models, Mat2Seq achieves promising performance in crystal structure generation as compared with prior methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes