LG CL CVSep 27, 2023

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Tushar Nagarajan, Matt Smith, Shashank Jain, Chun-Fu Yeh, Prakash Murugesan, Peyman Heidari, Yue Liu, Kavya Srinet, Babak Damavandi

arXiv:2309.16058v136.1124 citationsh-index: 15

Originality Incremental advance

AI Analysis

This work addresses the challenge of multimodal AI integration for applications requiring processing of diverse data types, though it appears incremental by building on existing LLMs and aligner modules.

The authors tackled the problem of creating a unified language model that can reason over diverse input modalities (text, image, video, audio, IMU motion sensor) and generate textual responses, achieving state-of-the-art performance on various multimodal tasks as demonstrated through comprehensive empirical analysis.

We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module. To further strengthen the multimodal LLM's capabilities, we fine-tune the model with a multimodal instruction set manually collected to cover diverse topics and tasks beyond simple QAs. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.

View on arXiv PDF

Similar