LGCVMMFeb 23, 2024

Multimodal Transformer With a Low-Computational-Cost Guarantee

arXiv:2402.15096v12 citationsh-index: 2ICASSP
Originality Incremental advance
AI Analysis

This addresses efficiency issues for researchers and practitioners using multimodal AI, though it is incremental as it builds on existing Transformer frameworks.

The paper tackles the high computational cost of multimodal Transformers by introducing LoCoMT, a novel attention mechanism that reduces GFLOPs while matching or outperforming established models on datasets like Audioset and MedVidCL.

Transformer-based models have significantly improved performance across a range of multimodal understanding tasks, such as visual question answering and action recognition. However, multimodal Transformers significantly suffer from a quadratic complexity of the multi-head attention with the input sequence length, especially as the number of modalities increases. To address this, we introduce Low-Cost Multimodal Transformer (LoCoMT), a novel multimodal attention mechanism that aims to reduce computational cost during training and inference with minimal performance loss. Specifically, by assigning different multimodal attention patterns to each attention head, LoCoMT can flexibly control multimodal signals and theoretically ensures a reduced computational cost compared to existing multimodal Transformer variants. Experimental results on two multimodal datasets, namely Audioset and MedVidCL demonstrate that LoCoMT not only reduces GFLOPs but also matches or even outperforms established models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes