Learning to Generate Long-term Future Narrations Describing Activities of Daily Living
This addresses the need for context-rich future event descriptions in applications like healthcare and smart homes, representing a novel task extension beyond traditional action anticipation.
The paper tackles the problem of generating detailed narrations of future daily activities from videos, proposing a visual-language model called ViNa that predicts sequences of future narrations over extended time horizons, evaluated on the Ego4D dataset.
Anticipating future events is crucial for various application domains such as healthcare, smart home technology, and surveillance. Narrative event descriptions provide context-rich information, enhancing a system's future planning and decision-making capabilities. We propose a novel task: $\textit{long-term future narration generation}$, which extends beyond traditional action anticipation by generating detailed narrations of future daily activities. We introduce a visual-language model, ViNa, specifically designed to address this challenging task. ViNa integrates long-term videos and corresponding narrations to generate a sequence of future narrations that predict subsequent events and actions over extended time horizons. ViNa extends existing multimodal models that perform only short-term predictions or describe observed videos by generating long-term future narrations for a broader range of daily activities. We also present a novel downstream application that leverages the generated narrations called future video retrieval to help users improve planning for a task by visualizing the future. We evaluate future narration generation on the largest egocentric dataset Ego4D.