CL ASFeb 17

ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling

Nicol Visser, Simon Malan, Danel Slabbert, Herman Kamper

arXiv:2602.15537v11.13 citationsh-index: 18

Originality Incremental advance

AI Analysis

This addresses the challenge of efficient spoken language modeling for applications in speech processing, though it is incremental as it builds on existing self-supervised encoders.

The paper tackled the problem of long sequences in pure speech language models by proposing ZeroSyl, a training-free method for syllable tokenization using a frozen WavLM model, which outperformed prior methods on lexical, syntactic, and narrative benchmarks.

Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM's intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.

View on arXiv PDF

Similar