Representing Speech Through Autoregressive Prediction of Cochlear Tokens
This work addresses speech processing for AI systems by proposing a biologically inspired model, though it appears incremental as it builds on existing auditory processing and autoregressive methods.
The paper tackled speech representation by introducing AuriStream, a two-stage model that transforms audio into cochlear tokens and applies an autoregressive sequence model, achieving competitive performance on SUPERB speech tasks and generating audio continuations.
We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete \textbf{cochlear tokens}. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStream's strong representational capabilities, it generates continuations of audio which can be visualized in a spectrogram space and decoded back into audio, providing insights into the model's predictions. In summary, we present a two-stage framework for speech representation learning to advance the development of more human-like models that efficiently handle a range of speech-based tasks.