TokenVerse++: Towards Flexible Multitask Learning with Dynamic Task Activation
This work addresses a practical problem for researchers and practitioners in speech processing by enabling more flexible and scalable multitask learning with partial annotations, though it is incremental over TokenVerse.
The paper tackles the limitation of multitask learning frameworks requiring full labels for all tasks by proposing TokenVerse++, which uses learnable vectors for dynamic task activation to train with partially annotated datasets. It achieves performance on par with or exceeding the baseline TokenVerse across multiple tasks, including ASR and language identification, without sacrificing ASR performance.
Token-based multitasking frameworks like TokenVerse require all training utterances to have labels for all tasks, hindering their ability to leverage partially annotated datasets and scale effectively. We propose TokenVerse++, which introduces learnable vectors in the acoustic embedding space of the XLSR-Transducer ASR model for dynamic task activation. This core mechanism enables training with utterances labeled for only a subset of tasks, a key advantage over TokenVerse. We demonstrate this by successfully integrating a dataset with partial labels, specifically for ASR and an additional task, language identification, improving overall performance. TokenVerse++ achieves results on par with or exceeding TokenVerse across multiple tasks, establishing it as a more practical multitask alternative without sacrificing ASR performance.