Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning
This work addresses the need for more detailed image understanding in computer vision, though it is incremental as it builds on existing captioning methods with a novel focus on relationships.
The paper tackled the problem of generating dense and informative image captions by introducing relational captioning, a task that produces multiple captions based on object relationships, and proposed a multi-task triple-stream network (MTTSNet) that uses part-of-speech tags as a prior, resulting in more diverse and richer representations compared to baselines.
Our goal in this work is to train an image captioning model that generates more dense and informative captions. We introduce "relational captioning," a novel image captioning task which aims to generate multiple captions with respect to relational information between objects in an image. Relational captioning is a framework that is advantageous in both diversity and amount of information, leading to image understanding based on relationships. Part-of speech (POS, i.e. subject-object-predicate categories) tags can be assigned to every English word. We leverage the POS as a prior to guide the correct sequence of words in a caption. To this end, we propose a multi-task triple-stream network (MTTSNet) which consists of three recurrent units for the respective POS and jointly performs POS prediction and captioning. We demonstrate more diverse and richer representations generated by the proposed model against several baselines and competing methods.