PolySet: Restoring the Statistical Ensemble Nature of Polymers for Machine Learning
This addresses a foundational limitation in polymer science for researchers using ML, though it is incremental as it builds on existing methods with a new encoding approach.
The authors tackled the mismatch between real polymers as statistical ensembles and their typical single-graph representation in machine learning by introducing PolySet, a framework that encodes polymers as weighted ensembles, which improved stability and accuracy in learning tail-sensitive properties.
Machine-learning (ML) models in polymer science typically treat a polymer as a single, perfectly defined molecular graph, even though real materials consist of stochastic ensembles of chains with distributed lengths. This mismatch between physical reality and digital representation limits the ability of current models to capture polymer behaviour. Here we introduce PolySet, a framework that represents a polymer as a finite, weighted ensemble of chains sampled from an assumed molar-mass distribution. This ensemble-based encoding is independent of chemical detail, compatible with any molecular representation and illustrated here in the homopolymer case using a minimal language model. We show that PolySet retains higher-order distributional moments (such as Mz, Mz+1), enabling ML models to learn tail-sensitive properties with greatly improved stability and accuracy. By explicitly acknowledging the statistical nature of polymer matter, PolySet establishes a physically grounded foundation for future polymer machine learning, naturally extensible to copolymers, block architectures, and other complex topologies.