Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
This work addresses the problem of real-time voice assistants for users needing seamless multimodal interaction, representing a novel method for a known bottleneck rather than a foundational shift.
The paper tackles the challenge of integrating speech and text modalities in AI by introducing Ichigo, a mixed-modal model that processes interleaved sequences of speech and text using a tokenized early-fusion approach, achieving state-of-the-art performance on speech question-answering benchmarks with a latency of 111 ms to first token generation.
Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech language models and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-language models.