SDAILGASDec 8, 2025

JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention

arXiv:2512.07168v11 citationsh-index: 19
Originality Incremental advance
AI Analysis

This work addresses speech representation learning for audio processing, offering incremental improvements in tokenization efficiency.

The paper tackles the problem of learning robust speech representations by introducing a two-stage self-supervised framework combining JEPA with DAAM, resulting in tokens at 47.5 tokens/sec that are competitive with existing neural audio codecs in efficiency.

We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes