CLASNov 4, 2024

Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback

arXiv:2411.01834v233 citationsh-index: 21ACL
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving semantic understanding in speech-to-speech models for applications in natural language processing and speech technology, representing an incremental advancement in SLM methods.

The paper tackles the problem of textless Spoken Language Models (SLMs) lagging behind text-based models in semantic coherence by introducing the Align-SLM framework, which uses reinforcement learning from AI feedback for preference optimization, achieving state-of-the-art performance on benchmarks like ZeroSpeech 2021 and StoryCloze.

While textless Spoken Language Models (SLMs) have shown potential in end-to-end speech-to-speech modeling, they still lag behind text-based Large Language Models (LLMs) in terms of semantic coherence and relevance. This work introduces the Align-SLM framework, which leverages preference optimization inspired by Reinforcement Learning with AI Feedback (RLAIF) to enhance the semantic understanding of SLMs. Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO). We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation. Experimental results show that our method achieves state-of-the-art performance for SLMs on most benchmarks, highlighting the importance of preference optimization to improve the semantics of SLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes