CVCLSep 9, 2021

TxT: Crossmodal End-to-End Learning with Transformers

arXiv:2109.04422v12 citations
AI Analysis

This work addresses the limitation of using pre-extracted visual features in multimodal pipelines, offering a more integrated approach for researchers in computer vision and natural language processing, though it is incremental in combining existing transformer-based methods.

The paper tackles the problem of multimodal reasoning in Visual Question Answering by proposing TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components end-to-end, achieving considerable gains in performance.

Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. Despite the widespread success of end-to-end learning, today's multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today's multimodal pipelines. We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. We overcome existing limitations of transformer-based detectors for multimodal reasoning regarding the integration of global context and their scalability. Our transformer-based multimodal model achieves considerable gains from end-to-end learning for multimodal question answering.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes