Vision Transformers for End-to-End Quark-Gluon Jet Classification from Calorimeter Images
This work addresses a critical task in high-energy physics for improving new physics searches at the Large Hadron Collider, establishing a systematic framework for applying ViTs to jet classification, though it is incremental as it builds on existing deep learning methods.
The paper tackled quark-gluon jet classification from calorimeter images using Vision Transformers (ViTs) and hybrid models, showing that ViT-based approaches outperform CNN baselines in metrics like F1-score, ROC-AUC, and accuracy on simulated CMS Open Data.
Distinguishing between quark- and gluon-initiated jets is a critical and challenging task in high-energy physics, pivotal for improving new physics searches and precision measurements at the Large Hadron Collider. While deep learning, particularly Convolutional Neural Networks (CNNs), has advanced jet tagging using image-based representations, the potential of Vision Transformer (ViT) architectures, renowned for modeling global contextual information, remains largely underexplored for direct calorimeter image analysis, especially under realistic detector and pileup conditions. This paper presents a systematic evaluation of ViTs and ViT-CNN hybrid models for quark-gluon jet classification using simulated 2012 CMS Open Data. We construct multi-channel jet-view images from detector-level energy deposits (ECAL, HCAL) and reconstructed tracks, enabling an end-to-end learning approach. Our comprehensive benchmarking demonstrates that ViT-based models, notably ViT+MaxViT and ViT+ConvNeXt hybrids, consistently outperform established CNN baselines in F1-score, ROC-AUC, and accuracy, highlighting the advantage of capturing long-range spatial correlations within jet substructure. This work establishes the first systematic framework and robust performance baselines for applying ViT architectures to calorimeter image-based jet classification using public collider data, alongside a structured dataset suitable for further deep learning research in this domain.