Learning to play: A Multimodal Agent for 3D Game-Play
This addresses the challenge of developing AI agents for interactive 3D gaming environments, but it is incremental as it builds on existing behavior cloning methods with a custom architecture.
The paper tackled the problem of real-time multimodal reasoning in 3D first-person video games by training a text-conditioned agent using behavior cloning on a large, diverse dataset of human gameplay, resulting in a model capable of playing various games and responding to text input in real-time on consumer GPUs.
We argue that 3-D first-person video games are a challenging environment for real-time multi-modal reasoning. We first describe our dataset of human game-play, collected across a large variety of 3-D first-person games, which is both substantially larger and more diverse compared to prior publicly disclosed datasets, and contains text instructions. We demonstrate that we can learn an inverse dynamics model from this dataset, which allows us to impute actions on a much larger dataset of publicly available videos of human game play that lack recorded actions. We then train a text-conditioned agent for game playing using behavior cloning, with a custom architecture capable of realtime inference on a consumer GPU. We show the resulting model is capable of playing a variety of 3-D games and responding to text input. Finally, we outline some of the remaining challenges such as long-horizon tasks and quantitative evaluation across a large set of games.