Universal Spectral Tokenization via Self-Supervised Panchromatic Representation Learning
This work addresses the problem of fragmented spectral data analysis in astronomy, enabling better information pooling across datasets, though it is incremental in extending self-supervised methods to this domain.
The authors tackled the challenge of unifying heterogeneous astronomical spectra across different resolutions and domains by developing a self-supervised deep learning model that processes spectra directly on their native grids, producing aligned and meaningful representations. They demonstrated that this single model achieves competitive performance across various downstream tasks, suggesting its potential as a building block for foundation models in astronomy and other scientific domains.
Sequential scientific data span many resolutions and domains, and unifying them into a common representation is a key step toward developing foundation models for the sciences. Astronomical spectra exemplify this challenge: massive surveys have collected millions of spectra across a wide range of wavelengths and resolutions, yet analyses remain fragmented across spectral domains (e.g., optical vs. infrared) and object types (e.g., stars vs. galaxies), limiting the ability to pool information across datasets. We present a deep learning model that jointly learns from heterogeneous spectra in a self-supervised manner. Our universal spectral tokenizer processes spectra from a variety of object types and resolutions directly on their native wavelength grids, producing intrinsically aligned, homogeneous, and physically meaningful representations that can be efficiently adapted to achieve competitive performance across a range of downstream tasks. For the first time, we demonstrate that a single model can unify spectral data across resolutions and domains, suggesting that our model can serve as a powerful building block for foundation models in astronomy -- and potentially extend to other scientific domains with heterogeneous sequential data, such as climate and healthcare.