Diffusion Models Are Real-Time Game Engines
This enables real-time, high-quality simulation of complex environments for gaming and interactive applications, representing a novel application of diffusion models rather than a foundational advance.
The authors tackled the problem of generating real-time interactive game environments using neural models, resulting in GameNGen, which runs at 20 frames per second on a single TPU and achieves a PSNR of 29.4 for next frame prediction, with human raters performing only slightly better than random chance at distinguishing it from real game clips.
We present GameNGen, the first game engine powered entirely by a neural model that also enables real-time interaction with a complex environment over long trajectories at high quality. When trained on the classic game DOOM, GameNGen extracts gameplay and uses it to generate a playable environment that can interactively simulate new trajectories. GameNGen runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation, even after 5 minutes of auto-regressive generation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations help ensure stable auto-regressive generation over long trajectories, and decoder fine-tuning improves the fidelity of visual details and text.