Generating Descriptions for Sequential Images with Local-Object Attention and Global Semantic Context Modelling
This work addresses the problem of generating coherent descriptions for sequential images, which is relevant for applications like video captioning or story generation, but the specific gains are not quantified.
This paper proposes a CNN-LSTM model to generate descriptions for sequential images. The model uses a local-object attention mechanism and a multi-layer perceptron to capture global semantic context, outperforming baselines on Microsoft datasets across three evaluation metrics.
In this paper, we propose an end-to-end CNN-LSTM model for generating descriptions for sequential images with a local-object attention mechanism. To generate coherent descriptions, we capture global semantic context using a multi-layer perceptron, which learns the dependencies between sequential images. A paralleled LSTM network is exploited for decoding the sequence descriptions. Experimental results show that our model outperforms the baseline across three different evaluation metrics on the datasets published by Microsoft.