Sort Story: Sorting Jumbled Images and Captions into Stories
This addresses temporal common sense for AI tasks like QA and summarization, but it is incremental as it builds on existing sequencing approaches.
The paper tackles the problem of sequencing jumbled image-caption pairs into coherent stories, achieving strong results through ensemble-based methods that combine unary and pairwise predictions.
Temporal common sense has applications in AI tasks such as QA, multi-document summarization, and human-AI communication. We propose the task of sequencing -- given a jumbled set of aligned image-caption pairs that belong to a story, the task is to sort them such that the output sequence forms a coherent story. We present multiple approaches, via unary (position) and pairwise (order) predictions, and their ensemble-based combinations, achieving strong results on this task. We use both text-based and image-based features, which depict complementary improvements. Using qualitative examples, we demonstrate that our models have learnt interesting aspects of temporal common sense.