CLSDASAug 15, 2025

Representing Speech Through Autoregressive Prediction of Cochlear Tokens

arXiv:2508.11598v1h-index: 64INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses speech processing for AI systems by proposing a biologically inspired model, though it appears incremental as it builds on existing auditory processing and autoregressive methods.

The paper tackled speech representation by introducing AuriStream, a two-stage model that transforms audio into cochlear tokens and applies an autoregressive sequence model, achieving competitive performance on SUPERB speech tasks and generating audio continuations.

We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete \textbf{cochlear tokens}. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStream's strong representational capabilities, it generates continuations of audio which can be visualized in a spectrogram space and decoded back into audio, providing insights into the model's predictions. In summary, we present a two-stage framework for speech representation learning to advance the development of more human-like models that efficiently handle a range of speech-based tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes