Audio-JEPA: Joint-Embedding Predictive Architecture for Audio Representation Learning
This work addresses audio processing tasks like speech, music, and environmental sounds, but it is incremental as it adapts an existing framework to audio data.
The authors tackled audio representation learning by proposing Audio-JEPA, a self-supervised model that predicts latent representations of masked spectrogram patches, achieving comparable performance to wav2vec 2.0 and data2vec while using less than one-fifth of their training data.
Building on the Joint-Embedding Predictive Architecture (JEPA) paradigm, a recent self-supervised learning framework that predicts latent representations of masked regions in high-level feature spaces, we propose Audio-JEPA (Audio Joint-Embedding Predictive Architecture), tailored specifically for audio data. Audio-JEPA uses a simple Vision Transformer backbone to predict latent representations of masked spectrogram patches rather than reconstructing raw audio. We pre-train on unlabeled AudioSet clips (10s, 32kHz) with random patch masking on mel-spectrograms. We evaluate on the X-ARES suite covering speech, music, and environmental sound tasks. Although our implementation is a straightforward translation of the original model to audio, the results still show comparable performance to wav2vec 2.0 and data2vec while using less than one-fifth of their training data and with no hyper-parameter tuning. All code and pretrained checkpoints will be released on GitHub.