CVMar 5, 2021

Causal Attention for Vision-Language Tasks

Xu Yang, Hanwang Zhang, Guojun Qi, Jianfei Cai

arXiv:2103.03493v1212 citationsHas Code

AI Analysis

This addresses bias and generalization issues in vision-language tasks, offering a novel method that is incremental as it builds on existing attention mechanisms.

The authors tackled the problem of confounding effects causing harmful bias in attention-based vision-language models by introducing Causal Attention (CATT), which uses front-door adjustment for causal intervention without requiring confounder knowledge, resulting in improved performance across various models and enabling lighter models like LXMERT to match heavier ones like UNITER.

We present a novel attention mechanism: Causal Attention (CATT), to remove the ever-elusive confounding effect in existing attention-based vision-language models. This effect causes harmful bias that misleads the attention module to focus on the spurious correlations in training data, damaging the model generalization. As the confounder is unobserved in general, we use the front-door adjustment to realize the causal intervention, which does not require any knowledge on the confounder. Specifically, CATT is implemented as a combination of 1) In-Sample Attention (IS-ATT) and 2) Cross-Sample Attention (CS-ATT), where the latter forcibly brings other samples into every IS-ATT, mimicking the causal intervention. CATT abides by the Q-K-V convention and hence can replace any attention module such as top-down attention and self-attention in Transformers. CATT improves various popular attention-based vision-language models by considerable margins. In particular, we show that CATT has great potential in large-scale pre-training, e.g., it can promote the lighter LXMERT~\cite{tan2019lxmert}, which uses fewer data and less computational power, comparable to the heavier UNITER~\cite{chen2020uniter}. Code is published in \url{https://github.com/yangxuntu/catt}.

View on arXiv PDF Code

Similar