Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space

arXiv:2606.019093.3

Predicted impact top 94% in SD · last 90 daysOriginality Incremental advance

AI Analysis

For audio processing, Echo demonstrates the feasibility of multi-task coexistence in a single encoder without per-task fine-tuning, though it is a proof-of-concept with no SOTA claims.

Echo is a proof-of-concept audio system using a single 25M-parameter ViT encoder that jointly handles speaker diarization, speech recognition, and source separation in a shared latent space, achieving 15.00% blind DER, 97.80% PIT separation accuracy, and +9.52 dB latent SI-SDR on synthetic VoxCeleb2 mixtures.

We present Echo, a proof-of-concept audio system built around a single 25 M-parameter ViT encoder. The encoder is pretrained with a JEPA objective and then specialised by stages to carry speaker identity, phonetic content, and dynamic source routing in the same 512-dimensional latent space, with no per-task fine-tuning at deployment. Light heads handle diarization (ArcFace + VBx) and dynamic source separation (null-target K-set prediction). On synthetic VoxCeleb2 mixtures with unknown K, the canonical stack reaches 15.00% blind DER, 97.80% PIT separation accuracy with +9.52 dB latent SI-SDR, and a +53.50-point speaker/content factorisation gap on a held-out k-NN probe. The point of Echo is not a new SOTA on any single task but the joint coexistence of three tasks on one encoder at this footprint. We document the design stage by stage, report the dead-ends, and identify the structural wall on end-to-end ASR through the VQ bottleneck that still bounds the PoC.

View on arXiv PDF

Similar