Whisper-UT: A Unified Translation Framework for Speech and Text
This work addresses the problem of multi-modal adaptation for researchers and practitioners in machine translation, though it appears incremental as it builds on existing Whisper models with adapter-based fine-tuning.
The paper tackles the challenge of efficiently adapting encoder-decoder models to diverse uni/multi-modal scenarios by proposing Whisper-UT, a unified framework using lightweight adapters that enables seamless adaptation across tasks, including multi-modal machine translation that conditions on both speech and text inputs, and enhances speech translation performance through a 2-stage decoding strategy.
Encoder-decoder models have achieved remarkable success in speech and text tasks, yet efficiently adapting these models to diverse uni/multi-modal scenarios remains an open challenge. In this paper, we propose Whisper-UT, a unified and efficient framework that leverages lightweight adapters to enable seamless adaptation across tasks, including a multi-modal machine translation (MMT) task that explicitly conditions translation on both speech and source language text inputs. By incorporating ASR hypotheses or ground-truth transcripts as prompts, this approach not only enables the system to process both modalities simultaneously but also enhances speech translation (ST) performance through a 2-stage decoding strategy. We demonstrate our methods using the Whisper model, though in principle they are general and could be applied to similar multitask models. We highlight the effectiveness of cross-modal and cross-task fine-tuning, which improves performance without requiring 3-way parallel data. Our results underscore the flexibility, efficiency, and general applicability of the proposed framework for multi-modal translation.