CL AI SD ASAug 5, 2025

Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts

Hojun Jin, Eunsoo Hong, Ziwon Hyung, Sungjun Lim, Seungjin Lee, Keunseok Cho

arXiv:2508.10009v1h-index: 5INTERSPEECH

Originality Incremental advance

AI Analysis

This addresses performance degradation in multi-task learning for speech-to-text applications, though it is incremental as it builds on existing mixture of experts methods.

The paper tackles task interference in multi-task speech-to-text models by proposing Supervised Mixture of Experts (S-MoE), which uses guiding tokens to route tasks to separate experts, achieving a 6.35% relative improvement in Word Error Rate.

Hard-parameter sharing is a common strategy to train a single model jointly across diverse tasks. However, this often leads to task interference, impeding overall model performance. To address the issue, we propose a simple yet effective Supervised Mixture of Experts (S-MoE). Unlike traditional Mixture of Experts models, S-MoE eliminates the need for training gating functions by utilizing special guiding tokens to route each task to its designated expert. By assigning each task to a separate feedforward network, S-MoE overcomes the limitations of hard-parameter sharing. We further apply S-MoE to a speech-to-text model, enabling the model to process mixed-bandwidth input while jointly performing automatic speech recognition (ASR) and speech translation (ST). Experimental results demonstrate the effectiveness of the proposed S-MoE, achieving a 6.35% relative improvement in Word Error Rate (WER) when applied to both the encoder and decoder.

View on arXiv PDF

Similar