Incorporating Textual Evidence in Visual Storytelling
This work addresses visual storytelling for AI applications, but it is incremental as it builds on existing methods by adding textual evidence.
The paper tackles the problem of generating coherent stories from image sequences by incorporating textual evidence from similar images, achieving state-of-the-art performance on the VIST dataset without heavy engineering.
Previous work on visual storytelling mainly focused on exploring image sequence as evidence for storytelling and neglected textual evidence for guiding story generation. Motivated by human storytelling process which recalls stories for familiar images, we exploit textual evidence from similar images to help generate coherent and meaningful stories. To pick the images which may provide textual experience, we propose a two-step ranking method based on image object recognition techniques. To utilize textual information, we design an extended Seq2Seq model with two-channel encoder and attention. Experiments on the VIST dataset show that our method outperforms state-of-the-art baseline models without heavy engineering.