ARCON: Advancing Auto-Regressive Continuation for Driving Videos
This work addresses video prediction for autonomous driving systems, representing an incremental improvement in applying large vision models to this domain.
The paper tackled the problem of video continuation for autonomous driving by introducing ARCON, a scheme that alternates between generating semantic and RGB tokens, resulting in consistent long video generation with enhanced visual quality through optical flow-based texture stitching.
Recent advancements in auto-regressive large language models (LLMs) have led to their application in video generation. This paper explores the use of Large Vision Models (LVMs) for video continuation, a task essential for building world models and predicting future frames. We introduce ARCON, a scheme that alternates between generating semantic and RGB tokens, allowing the LVM to explicitly learn high-level structural video information. We find high consistency in the RGB images and semantic maps generated without special design. Moreover, we employ an optical flow-based texture stitching method to enhance visual quality. Experiments in autonomous driving scenarios show that our model can consistently generate long videos.