HEP-PHLGSep 19, 2024

Is Tokenization Needed for Masked Particle Modelling?

arXiv:2409.12589v223 citationsh-index: 88
Originality Incremental advance
AI Analysis

This work incrementally improves foundation models for high-energy physics jets, benefiting researchers in particle physics with better classification, vertex finding, and track identification.

The paper tackles inefficiencies in masked particle modeling (MPM) for self-supervised learning on unordered sets in high-energy physics, achieving significant performance improvements by introducing new reconstruction methods without tokenization or discretization that outperform the original MPM on jet physics tasks.

In this work, we significantly enhance masked particle modeling (MPM), a self-supervised learning scheme for constructing highly expressive representations of unordered sets relevant to developing foundation models for high-energy physics. In MPM, a model is trained to recover the missing elements of a set, a learning objective that requires no labels and can be applied directly to experimental data. We achieve significant performance improvements over previous work on MPM by addressing inefficiencies in the implementation and incorporating a more powerful decoder. We compare several pre-training tasks and introduce new reconstruction methods that utilize conditional generative models without data tokenization or discretization. We show that these new methods outperform the tokenized learning objective from the original MPM on a new test bed for foundation models for jets, which includes using a wide variety of downstream tasks relevant to jet physics, such as classification, secondary vertex finding, and track identification.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes