Transforming Visual Scene Graphs to Image CaptionsXu Yang, Jiawei Peng, Zihua Wang et al.
We propose to Transform Scene Graphs (TSG) into more descriptive captions. In TSG, we apply multi-head attention (MHA) to design the Graph Neural Network (GNN) for embedding scene graphs. After embedding, different graph embeddings contain diverse specific knowledge for generating the words with different part-of-speech, e.g., object/attribute embedding is good for generating nouns/adjectives. Motivated by this, we design a Mixture-of-Expert (MOE)-based decoder, where each expert is built on MHA, for discriminating the graph embeddings to generate different kinds of words. Since both the encoder and decoder are built based on the MHA, as a result, we construct a homogeneous encoder-decoder unlike the previous heterogeneous ones which usually apply Fully-Connected-based GNN and LSTM-based decoder. The homogeneous architecture enables us to unify the training configuration of the whole model instead of specifying different training strategies for diverse sub-networks as in the heterogeneous pipeline, which releases the training difficulty. Extensive experiments on the MS-COCO captioning benchmark validate the effectiveness of our TSG. The code is in: https://github.com/GaryJiajia/TSG.
Towards Unbiased Visual Emotion Recognition via Causal InterventionYuedong Chen, Xu Yang, Tat-Jen Cham et al.
Although much progress has been made in visual emotion recognition, researchers have realized that modern deep networks tend to exploit dataset characteristics to learn spurious statistical associations between the input and the target. Such dataset characteristics are usually treated as dataset bias, which damages the robustness and generalization performance of these recognition systems. In this work, we scrutinize this problem from the perspective of causal inference, where such dataset characteristic is termed as a confounder which misleads the system to learn the spurious correlation. To alleviate the negative effects brought by the dataset bias, we propose a novel Interventional Emotion Recognition Network (IERN) to achieve the backdoor adjustment, which is one fundamental deconfounding technique in causal inference. Specifically, IERN starts by disentangling the dataset-related context feature from the actual emotion feature, where the former forms the confounder. The emotion feature will then be forced to see each confounder stratum equally before being fed into the classifier. A series of designed tests validate the efficacy of IERN, and experiments on three emotion benchmarks demonstrate that IERN outperforms state-of-the-art approaches for unbiased visual emotion recognition. Code is available at https://github.com/donydchen/causal_emotion
Causal Attention for Vision-Language TasksXu Yang, Hanwang Zhang, Guojun Qi et al.
We present a novel attention mechanism: Causal Attention (CATT), to remove the ever-elusive confounding effect in existing attention-based vision-language models. This effect causes harmful bias that misleads the attention module to focus on the spurious correlations in training data, damaging the model generalization. As the confounder is unobserved in general, we use the front-door adjustment to realize the causal intervention, which does not require any knowledge on the confounder. Specifically, CATT is implemented as a combination of 1) In-Sample Attention (IS-ATT) and 2) Cross-Sample Attention (CS-ATT), where the latter forcibly brings other samples into every IS-ATT, mimicking the causal intervention. CATT abides by the Q-K-V convention and hence can replace any attention module such as top-down attention and self-attention in Transformers. CATT improves various popular attention-based vision-language models by considerable margins. In particular, we show that CATT has great potential in large-scale pre-training, e.g., it can promote the lighter LXMERT~\cite{tan2019lxmert}, which uses fewer data and less computational power, comparable to the heavier UNITER~\cite{chen2020uniter}. Code is published in \url{https://github.com/yangxuntu/catt}.
Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship FeaturesXu Yang, Hanwang Zhang, Jianfei Cai
Due to the fact that it is prohibitively expensive to completely annotate visual relationships, i.e., the (obj1, rel, obj2) triplets, relationship models are inevitably biased to object classes of limited pairwise patterns, leading to poor generalization to rare or unseen object combinations. Therefore, we are interested in learning object-agnostic visual features for more generalizable relationship models. By "agnostic", we mean that the feature is less likely biased to the classes of paired objects. To alleviate the bias, we propose a novel \texttt{Shuffle-Then-Assemble} pre-training strategy. First, we discard all the triplet relationship annotations in an image, leaving two unpaired object domains without obj1-obj2 alignment. Then, our feature learning is to recover possible obj1-obj2 pairs. In particular, we design a cycle of residual transformations between the two domains, to capture shared but not object-specific visual patterns. Extensive experiments on two visual relationship benchmarks show that by using our pre-trained features, naive relationship models can be consistently improved and even outperform other state-of-the-art relationship models. Code has been made available at: \url{https://github.com/yangxuntu/vrd}.
12.1CVAug 24, 2021
Auto-Parsing Network for Image Captioning and Visual Question AnsweringXu Yang, Chongyang Gao, Hanwang Zhang et al.
We propose an Auto-Parsing Network (APN) to discover and exploit the input data's hidden tree structures for improving the effectiveness of the Transformer-based vision-language systems. Specifically, we impose a Probabilistic Graphical Model (PGM) parameterized by the attention operations on each self-attention layer to incorporate sparse assumption. We use this PGM to softly segment an input sequence into a few clusters where each cluster can be treated as the parent of the inside entities. By stacking these PGM constrained self-attention layers, the clusters in a lower layer compose into a new sequence, and the PGM in a higher layer will further segment this sequence. Iteratively, a sparse tree can be implicitly parsed, and this tree's hierarchical knowledge is incorporated into the transformed embeddings, which can be used for solving the target vision-language tasks. Specifically, we showcase that our APN can strengthen Transformer based networks in two major vision-language tasks: Captioning and Visual Question Answering. Also, a PGM probability-based parsing algorithm is developed by which we can discover what the hidden structure of input is during the inference.
12.4CVSep 30, 2020
Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change CaptioningXiangxi Shi, Xu Yang, Jiuxiang Gu et al.
Change Captioning is a task that aims to describe the difference between images with natural language. Most existing methods treat this problem as a difference judgment without the existence of distractors, such as viewpoint changes. However, in practice, viewpoint changes happen often and can overwhelm the semantic difference to be described. In this paper, we propose a novel visual encoder to explicitly distinguish viewpoint changes from semantic changes in the change captioning task. Moreover, we further simulate the attention preference of humans and propose a novel reinforcement learning process to fine-tune the attention directly with language evaluation rewards. Extensive experimental results show that our method outperforms the state-of-the-art approaches by a large margin in both Spot-the-Diff and CLEVR-Change datasets.
25.8CVMar 9, 2020
Deconfounded Image Captioning: A Causal RetrospectXu Yang, Hanwang Zhang, Jianfei Cai
Dataset bias in vision-language tasks is becoming one of the main problems which hinders the progress of our community. Existing solutions lack a principled analysis about why modern image captioners easily collapse into dataset bias. In this paper, we present a novel perspective: Deconfounded Image Captioning (DIC), to find out the answer of this question, then retrospect modern neural image captioners, and finally propose a DIC framework: DICv1.0 to alleviate the negative effects brought by dataset bias. DIC is based on causal inference, whose two principles: the backdoor and front-door adjustments, help us review previous studies and design new effective models. In particular, we showcase that DICv1.0 can strengthen two prevailing captioning models and can achieve a single-model 131.1 CIDEr-D and 128.4 c40 CIDEr-D on Karpathy split and online split of the challenging MS COCO dataset, respectively. Interestingly, DICv1.0 is a natural derivation from our causal retrospect, which opens promising directions for image captioning.
17.1CVApr 18, 2019
Learning to Collocate Neural Modules for Image CaptioningXu Yang, Hanwang Zhang, Jianfei Cai
We do not speak word by word from scratch; our brain quickly structures a pattern like \textsc{sth do sth at someplace} and then fill in the detailed descriptions. To render existing encoder-decoder image captioners such human-like reasoning, we propose a novel framework: learning to Collocate Neural Modules (CNM), to generate the `inner pattern' connecting visual encoder and language decoder. Unlike the widely-used neural module networks in visual Q\&A, where the language (ie, question) is fully observable, CNM for captioning is more challenging as the language is being generated and thus is partially observable. To this end, we make the following technical contributions for CNM training: 1) compact module design --- one for function words and three for visual content words (eg, noun, adjective, and verb), 2) soft module fusion and multi-step module execution, robustifying the visual reasoning in partial observation, 3) a linguistic loss for module controller being faithful to part-of-speech collocations (eg, adjective is before noun). Extensive experiments on the challenging MS-COCO image captioning benchmark validate the effectiveness of our CNM image captioner. In particular, CNM achieves a new state-of-the-art 127.9 CIDEr-D on Karpathy split and a single-model 126.0 c40 on the official server. CNM is also robust to few training samples, eg, by training only one sentence per image, CNM can halve the performance loss compared to a strong baseline.
25.2CVMar 26, 2019
Unpaired Image Captioning via Scene Graph AlignmentsJiuxiang Gu, Shafiq Joty, Jianfei Cai et al.
Most of current image captioning models heavily rely on paired image-caption datasets. However, getting large scale image-caption paired data is labor-intensive and time-consuming. In this paper, we present a scene graph-based approach for unpaired image captioning. Our framework comprises an image scene graph generator, a sentence scene graph generator, a scene graph encoder, and a sentence decoder. Specifically, we first train the scene graph encoder and the sentence decoder on the text modality. To align the scene graphs between images and sentences, we propose an unsupervised feature alignment method that maps the scene graph features from the image to the sentence modality. Experimental results show that our proposed model can generate quite promising results without using any image-caption training pairs, outperforming existing methods by a wide margin.
Auto-Encoding Scene Graphs for Image CaptioningXu Yang, Kaihua Tang, Hanwang Zhang et al.
We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language inductive bias into the encoder-decoder image captioning framework for more human-like captions. Intuitively, we humans use the inductive bias to compose collocations and contextual inference in discourse. For example, when we see the relation `person on bike', it is natural to replace `on' with `ride' and infer `person riding bike on a road' even the `road' is not evident. Therefore, exploiting such bias as a language prior is expected to help the conventional encoder-decoder models less likely overfit to the dataset bias and focus on reasoning. Specifically, we use the scene graph --- a directed graph ($\mathcal{G}$) where an object node is connected by adjective nodes and relationship nodes --- to represent the complex structural layout of both image ($\mathcal{I}$) and sentence ($\mathcal{S}$). In the textual domain, we use SGAE to learn a dictionary ($\mathcal{D}$) that helps to reconstruct sentences in the $\mathcal{S}\rightarrow \mathcal{G} \rightarrow \mathcal{D} \rightarrow \mathcal{S}$ pipeline, where $\mathcal{D}$ encodes the desired language prior; in the vision-language domain, we use the shared $\mathcal{D}$ to guide the encoder-decoder in the $\mathcal{I}\rightarrow \mathcal{G}\rightarrow \mathcal{D} \rightarrow \mathcal{S}$ pipeline. Thanks to the scene graph representation and shared dictionary, the inductive bias is transferred across domains in principle. We validate the effectiveness of SGAE on the challenging MS-COCO image captioning benchmark, e.g., our SGAE-based single-model achieves a new state-of-the-art $127.8$ CIDEr-D on the Karpathy split, and a competitive $125.5$ CIDEr-D (c40) on the official server even compared to other ensemble models.