CLAISDASAug 5, 2025

Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts

arXiv:2508.10009v1h-index: 5INTERSPEECH
Originality Incremental advance
AI Analysis

This addresses performance degradation in multi-task learning for speech-to-text applications, though it is incremental as it builds on existing mixture of experts methods.

The paper tackles task interference in multi-task speech-to-text models by proposing Supervised Mixture of Experts (S-MoE), which uses guiding tokens to route tasks to separate experts, achieving a 6.35% relative improvement in Word Error Rate.

Hard-parameter sharing is a common strategy to train a single model jointly across diverse tasks. However, this often leads to task interference, impeding overall model performance. To address the issue, we propose a simple yet effective Supervised Mixture of Experts (S-MoE). Unlike traditional Mixture of Experts models, S-MoE eliminates the need for training gating functions by utilizing special guiding tokens to route each task to its designated expert. By assigning each task to a separate feedforward network, S-MoE overcomes the limitations of hard-parameter sharing. We further apply S-MoE to a speech-to-text model, enabling the model to process mixed-bandwidth input while jointly performing automatic speech recognition (ASR) and speech translation (ST). Experimental results demonstrate the effectiveness of the proposed S-MoE, achieving a 6.35% relative improvement in Word Error Rate (WER) when applied to both the encoder and decoder.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes