jBOT: Semantic Jet Representation Clustering Emerges from Self-Distillation

arXiv:2601.11719v2h-index: 5

Originality Incremental advance

AI Analysis

This work addresses the need for efficient pre-training methods in high-energy physics, offering incremental improvements for jet data analysis tasks.

The paper tackles the problem of learning semantic jet representations from unlabeled data at the CERN Large Hadron Collider, resulting in emergent clustering that enables anomaly detection via distance metrics and improved classification performance after fine-tuning compared to supervised models.

Self-supervised learning is a powerful pre-training method for learning feature representations without labels, which often capture generic underlying semantics from the data and can later be fine-tuned for downstream tasks. In this work, we introduce jBOT, a pre-training method based on self-distillation for jet data from the CERN Large Hadron Collider, which combines local particle-level distillation with global jet-level distillation to learn jet representations that support downstream tasks such as anomaly detection and classification. We observe that pre-training on unlabeled jets leads to emergent semantic class clustering in the representation space. The clustering in the frozen embedding, when pre-trained on background jets only, enables anomaly detection via simple distance-based metrics, and the learned embedding can be fine-tuned for classification with improved performance compared to supervised models trained from scratch.

View on arXiv PDF

Similar