AIDec 4, 2025
SIMA 2: A Generalist Embodied Agent for Virtual WorldsSIMA team, Adrian Bolton, Alexander Lerchner et al.
We introduce SIMA 2, a generalist embodied agent that understands and acts in a wide variety of 3D virtual worlds. Built upon a Gemini foundation model, SIMA 2 represents a significant step toward active, goal-directed interaction within an embodied environment. Unlike prior work (e.g., SIMA 1) limited to simple language commands, SIMA 2 acts as an interactive partner, capable of reasoning about high-level goals, conversing with the user, and handling complex instructions given through language and images. Across a diverse portfolio of games, SIMA 2 substantially closes the gap with human performance and demonstrates robust generalization to previously unseen environments, all while retaining the base model's core reasoning capabilities. Furthermore, we demonstrate a capacity for open-ended self-improvement: by leveraging Gemini to generate tasks and provide rewards, SIMA 2 can autonomously learn new skills from scratch in a new environment. This work validates a path toward creating versatile and continuously learning agents for both virtual and, eventually, physical worlds.
ROMar 13, 2024
Scaling Instructable Agents Across Many Simulated WorldsSIMA Team, Maria Abi Raad, Arun Ahuja et al. · deepmind, stanford
Building embodied AI systems that can follow arbitrary language instructions in any 3D environment is a key challenge for creating general AI. Accomplishing this goal requires learning to ground language in perception and embodied actions, in order to accomplish complex tasks. The Scalable, Instructable, Multiworld Agent (SIMA) project tackles this by training agents to follow free-form instructions across a diverse range of virtual 3D environments, including curated research environments as well as open-ended, commercial video games. Our goal is to develop an instructable agent that can accomplish anything a human can do in any simulated 3D environment. Our approach focuses on language-driven generality while imposing minimal assumptions. Our agents interact with environments in real-time using a generic, human-like interface: the inputs are image observations and language instructions and the outputs are keyboard-and-mouse actions. This general approach is challenging, but it allows agents to ground language across many visually complex and semantically rich environments while also allowing us to readily run agents in new environments. In this paper we describe our motivation and goal, the initial progress we have made, and promising preliminary results on several diverse research environments and a variety of commercial video games.
LGFeb 7, 2024
Amortized Planning with Large-Scale Transformers: A Case Study on ChessAnian Ruoss, Grégoire Delétang, Sourabh Medapati et al.
This paper uses chess, a landmark planning problem in AI, to assess transformers' performance on a planning task where memorization is futile $\unicode{x2013}$ even at a large scale. To this end, we release ChessBench, a large-scale benchmark dataset of 10 million chess games with legal move and value annotations (15 billion data points) provided by Stockfish 16, the state-of-the-art chess engine. We train transformers with up to 270 million parameters on ChessBench via supervised learning and perform extensive ablations to assess the impact of dataset size, model size, architecture type, and different prediction targets (state-values, action-values, and behavioral cloning). Our largest models learn to predict action-values for novel boards quite accurately, implying highly non-trivial generalization. Despite performing no explicit search, our resulting chess policy solves challenging chess puzzles and achieves a surprisingly strong Lichess blitz Elo of 2895 against humans (grandmaster level). We also compare to Leela Chess Zero and AlphaZero (trained without supervision via self-play) with and without search. We show that, although a remarkably good approximation of Stockfish's search-based algorithm can be distilled into large-scale transformers via supervised learning, perfect distillation is still beyond reach, thus making ChessBench well-suited for future research.
LGDec 9, 2024
Can foundation models actively gather information in interactive environments to test hypotheses?Danny P. Sawyer, Nan Rosemary Ke, Hubert Soyer et al. · deepmind
Foundation models excel at single-turn reasoning but struggle with multi-turn exploration in dynamic environments, a requirement for many real-world challenges. We evaluated these models on their ability to learn from experience, adapt, and gather information. First, in "Feature World," a simple setting for testing information gathering, models performed near-optimally. However, to test more complex, multi-trial learning, we implemented a text-based version of the "Alchemy" environment, a benchmark for meta-learning. Here, agents must deduce a latent causal structure by integrating information across many trials. In this setting, recent foundation models initially failed to improve their performance over time. Crucially, we found that prompting the models to summarize their observations at regular intervals enabled an emergent meta-learning process. This allowed them to improve across trials and even adaptively re-learn when the environment's rules changed unexpectedly. While most models handled the simple task, Alchemy revealed stark differences in robustness: Gemini 2.5 performed best, followed by Claude 3.7, while ChatGPT-4o and o4-mini struggled. This underscores Alchemy's value as a benchmark. Our findings demonstrate that the biggest challenge for foundation models is not selecting informative actions in the moment, but integrating knowledge through adaptive strategies over time. Encouragingly, there appears to be no intrinsic barrier to future models mastering these abilities.