Auto-Parsing Network for Image Captioning and Visual Question Answering
This work addresses the challenge of incorporating hierarchical knowledge into vision-language models for researchers and practitioners, but it appears incremental as it builds on existing Transformer methods with a novel structural constraint.
The authors tackled the problem of improving Transformer-based vision-language systems by proposing an Auto-Parsing Network (APN) that discovers hidden tree structures in input data, resulting in enhanced performance for image captioning and visual question answering tasks, though no concrete numbers are provided.
We propose an Auto-Parsing Network (APN) to discover and exploit the input data's hidden tree structures for improving the effectiveness of the Transformer-based vision-language systems. Specifically, we impose a Probabilistic Graphical Model (PGM) parameterized by the attention operations on each self-attention layer to incorporate sparse assumption. We use this PGM to softly segment an input sequence into a few clusters where each cluster can be treated as the parent of the inside entities. By stacking these PGM constrained self-attention layers, the clusters in a lower layer compose into a new sequence, and the PGM in a higher layer will further segment this sequence. Iteratively, a sparse tree can be implicitly parsed, and this tree's hierarchical knowledge is incorporated into the transformed embeddings, which can be used for solving the target vision-language tasks. Specifically, we showcase that our APN can strengthen Transformer based networks in two major vision-language tasks: Captioning and Visual Question Answering. Also, a PGM probability-based parsing algorithm is developed by which we can discover what the hidden structure of input is during the inference.