UniT: Multimodal Multitask Learning with a Unified Transformer
This work addresses the problem of inefficient multi-task learning for researchers and practitioners by enabling a single model to handle varied tasks like object detection and natural language understanding, though it is incremental in combining existing transformer architectures.
The authors tackled the challenge of performing multiple diverse tasks across different domains with a single model, proposing UniT, a Unified Transformer that jointly learns 7 tasks from 8 datasets, achieving strong performance with significantly fewer parameters.
We propose UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning. Based on the transformer encoder-decoder architecture, our UniT model encodes each input modality with an encoder and makes predictions on each task with a shared decoder over the encoded input representations, followed by task-specific output heads. The entire model is jointly trained end-to-end with losses from each task. Compared to previous efforts on multi-task learning with transformers, we share the same model parameters across all tasks instead of separately fine-tuning task-specific models and handle a much higher variety of tasks across different domains. In our experiments, we learn 7 tasks jointly over 8 datasets, achieving strong performance on each task with significantly fewer parameters. Our code is available in MMF at https://mmf.sh.