CV MMAug 6, 2024

One Framework to Rule Them All: Unifying Multimodal Tasks with LLM Neural-Tuning

Hao Sun, Yu Song, Jiaqing Liu, Jihong Hu, Yen-Wei Chen, Lanfen Lin

arXiv:2408.03001v32.02 citationsh-index: 44

Originality Incremental advance

AI Analysis

This work addresses scalability and versatility issues in multimodal AI for researchers and practitioners, though it appears incremental by building on existing tuning methods.

The authors tackled the challenge of enabling large models to process multimodal tasks efficiently by proposing a unified framework that handles multiple tasks and modalities with a single tuning strategy, and they demonstrated its effectiveness on a new benchmark.

Large-scale models have exhibited remarkable capabilities across diverse domains, including automated medical services and intelligent customer support. However, as most large models are trained on single-modality corpora, enabling them to effectively process and understand multimodal signals remains a significant challenge. Current research often focuses on designing task-specific or scenario-specific tuning strategies, which limits the scalability and versatility. To address this limitation, we propose a unified framework that concurrently handles multiple tasks and modalities. In this framework, all modalities and tasks are represented as unified tokens and trained using a single, consistent approach. To enable efficient multitask processing, we introduce a novel tuning strategy termed neural tuning, inspired by the concept of sparse distributed representation in the human brain, where only specific subsets of neurons are activated for each task. Furthermore, to advance research in multimodal and multitask learning, we present a new benchmark, MMUD, which includes samples annotated with multiple task labels spanning reasoning segmentation, referring segmentation, image captioning, and text-to-image generation. By applying neural tuning to pretrained large models on the MMUD benchmark, we demonstrate the ability to handle multiple tasks simultaneously in a streamlined and efficient manner. All models, code, and datasets will be released publicly upon publication, fostering further research and innovation in this field.

View on arXiv PDF

Similar