CLSDASApr 16, 2024

Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training

arXiv:2404.10922v131 citationsh-index: 11NAACL-HLT
Originality Highly original
AI Analysis

This work addresses the problem of limited LLM application in speech for multilingual contexts, representing a novel method for bridging text and speech modalities.

The paper tackles the challenge of applying multilingual large language models to speech by integrating a multilingual speech encoder with a multilingual LLM using multi-instructional training, achieving robust zero-shot performance across tasks like speech translation and multilingual spoken language understanding on data from 139 languages.

Recent advancements in language modeling have led to the emergence of Large Language Models (LLMs) capable of various natural language processing tasks. Despite their success in text-based tasks, applying LLMs to the speech domain remains limited and challenging. This paper presents BLOOMZMMS, a novel model that integrates a multilingual LLM with a multilingual speech encoder, aiming to harness the capabilities of LLMs for speech recognition and beyond. Utilizing a multi-instructional training approach, we demonstrate the transferability of linguistic knowledge from the text to the speech modality. Our experiments, conducted on 1900 hours of transcribed data from 139 languages, establish that a multilingual speech representation can be effectively learned and aligned with a multilingual LLM. While this learned representation initially shows limitations in task generalization, we address this issue by generating synthetic targets in a multi-instructional style. Our zero-shot evaluation results confirm the robustness of our approach across multiple tasks, including speech translation and multilingual spoken language understanding, thereby opening new avenues for applying LLMs in the speech domain.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes