AS CL SDJul 13, 2024

Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation

Chun-Yi Kuan, Chih-Kai Yang, Wei-Ping Huang, Ke-Han Lu, Hung-yi Lee

arXiv:2407.09886v214.927 citationsh-index: 12Has Code

Originality Incremental advance

AI Analysis

This provides a flexible solution for speech processing applications, reducing human effort in toolset construction, though it appears incremental as it builds on existing large language model capabilities.

The paper tackles the problem of instruction-oriented speech processing by introducing Speech-Copilot, a modular framework that decomposes tasks into sub-tasks and uses large language models for program generation, achieving state-of-the-art performance on the Dynamic-SUPERB benchmark.

In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to-end methods using large audio-language models, Speech-Copilot builds speech processing-specific toolsets by analyzing pre-collected task instructions and breaking tasks into manageable sub-tasks. It features a flexible agent based on large language models that performs tasks through program generation. Our approach achieves state-of-the-art performance on the Dynamic-SUPERB benchmark, demonstrating its effectiveness across diverse speech-processing tasks. Key contributions include: 1) developing an innovative framework for speech processing-specific toolset construction, 2) establishing a high-performing agent based on large language models, and 3) offering a new perspective on addressing challenging instruction-oriented speech-processing tasks. Without additional training processes required by end-to-end approaches, our method provides a flexible and extendable solution for a wide range of speech-processing applications.

View on arXiv PDF Code

Similar