CL SD ASFeb 26, 2025

When Large Language Models Meet Speech: A Survey on Integration Approaches

Zhengdong Yang, Shuichiro Shimizu, Yahan Yu, Chenhui Chu

arXiv:2502.19548v29 citationsh-index: 5ACL

Originality Synthesis-oriented

AI Analysis

It provides a comprehensive overview for researchers and practitioners working on multimodal AI, but it is incremental as it synthesizes existing studies without introducing new methods.

This survey categorizes and reviews methodologies for integrating speech with large language models, covering text-based, latent-representation-based, and audio-token-based approaches, and highlights applications and challenges in the field.

Recent advancements in large language models (LLMs) have spurred interest in expanding their application beyond text-based tasks. A large number of studies have explored integrating other modalities with LLMs, notably speech modality, which is naturally related to text. This paper surveys the integration of speech with LLMs, categorizing the methodologies into three primary approaches: text-based, latent-representation-based, and audio-token-based integration. We also demonstrate how these methods are applied across various speech-related applications and highlight the challenges in this field to offer inspiration for

View on arXiv PDF

Similar