CLAISDASOct 23, 2024

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

arXiv:2410.17799v257 citationsh-index: 11ACL
Originality Highly original
AI Analysis

This work addresses the problem of creating more human-like voice conversation systems for applications in AI assistants and human-computer interaction, representing a novel method for a known bottleneck.

The paper tackles the challenge of achieving low latency and natural interactions in full-duplex spoken dialogue systems by introducing OmniFlatten, an end-to-end GPT-based model that effectively models complex conversation behaviors like interruptions and overlapping speech, with results including real-time text and speech generation capabilities.

Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex conversation capabilities, we propose a multi-stage post-training scheme that progressively adapts a text large language model (LLM) backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. In all training stages, we standardize the data using a flattening operation, which enables unifying the training methods and the GPT backbone across different modalities and tasks. Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated by OmniFlatten can be found at this web site (https://omniflatten.github.io/).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes