Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach
This addresses the challenge of data inefficiency in speech-only systems for natural language processing, though it is an incremental improvement.
The paper tackles the problem of spoken language modeling by fine-tuning speech representation models on phoneme classification, resulting in language models that achieve comparable lexical comprehension to those trained on hundred times more data.
Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible. Generating speech through a pipeline that operates at the text level typically loses nuances, intonations, and non-verbal vocalizations. Modeling directly from speech opens up the path to more natural and expressive systems. On the other hand, speech-only systems require up to three orders of magnitude more data to catch up to their text-based counterparts in terms of their semantic abilities. We show that fine-tuning speech representation models on phoneme classification leads to more context-invariant representations, and language models trained on these units achieve comparable lexical comprehension to ones trained on hundred times more data.