CVNov 23, 2022

Make-A-Story: Visual Memory Conditioned Consistent Story Generation

Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Shweta Mahajan, Leonid Sigal

arXiv:2211.13319v327.498 citationsh-index: 58Has Code

Originality Highly original

AI Analysis

This addresses the problem of maintaining actor and background consistency in multi-sentence story generation for AI and creative applications, representing a novel method for a known bottleneck.

The paper tackles the challenge of story visualization by generating consistent images across frames despite ambiguous references, proposing a novel autoregressive diffusion framework with a visual memory module that outperforms prior state-of-the-art methods on datasets like MUGEN, PororoSV, and FlintstonesSV in visual quality and consistency.

There has been a recent explosion of impressive generative models that can produce high quality images (or videos) conditioned on text descriptions. However, all such approaches rely on conditional sentences that contain unambiguous descriptions of scenes and main actors in them. Therefore employing such models for more complex task of story visualization, where naturally references and co-references exist, and one requires to reason about when to maintain consistency of actors and backgrounds across frames/scenes, and when not to, based on story progression, remains a challenge. In this work, we address the aforementioned challenges and propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context across the generated frames. Sentence-conditioned soft attention over the memories enables effective reference resolution and learns to maintain scene and actor consistency when needed. To validate the effectiveness of our approach, we extend the MUGEN dataset and introduce additional characters, backgrounds and referencing in multi-sentence storylines. Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, which are consistent with the story, but also models appropriate correspondences between the characters and the background.

View on arXiv PDF Code

Similar