LG MLDec 9, 2024

Can foundation models actively gather information in interactive environments to test hypotheses?

Danny P. Sawyer, Nan Rosemary Ke, Hubert Soyer, Martin Engelcke, David P Reichert, Drew A. Hudson, John Reid, Alexander Lerchner, Danilo Jimenez Rezende, Timothy P Lillicrap, Michael Mozer, Jane X Wang

DeepMind

arXiv:2412.06438v213.47 citationsh-index: 69

Originality Incremental advance

AI Analysis

This addresses the challenge of enabling foundation models to integrate knowledge over time for real-world interactive applications, though the findings are incremental as they rely on prompting rather than architectural changes.

The paper tackled the problem of foundation models struggling with multi-turn exploration in dynamic environments, finding that while they performed well on simple information gathering tasks, they initially failed at complex meta-learning in the Alchemy benchmark until prompted to summarize observations, which enabled emergent meta-learning and adaptation to rule changes.

Foundation models excel at single-turn reasoning but struggle with multi-turn exploration in dynamic environments, a requirement for many real-world challenges. We evaluated these models on their ability to learn from experience, adapt, and gather information. First, in "Feature World," a simple setting for testing information gathering, models performed near-optimally. However, to test more complex, multi-trial learning, we implemented a text-based version of the "Alchemy" environment, a benchmark for meta-learning. Here, agents must deduce a latent causal structure by integrating information across many trials. In this setting, recent foundation models initially failed to improve their performance over time. Crucially, we found that prompting the models to summarize their observations at regular intervals enabled an emergent meta-learning process. This allowed them to improve across trials and even adaptively re-learn when the environment's rules changed unexpectedly. While most models handled the simple task, Alchemy revealed stark differences in robustness: Gemini 2.5 performed best, followed by Claude 3.7, while ChatGPT-4o and o4-mini struggled. This underscores Alchemy's value as a benchmark. Our findings demonstrate that the biggest challenge for foundation models is not selecting informative actions in the moment, but integrating knowledge through adaptive strategies over time. Encouragingly, there appears to be no intrinsic barrier to future models mastering these abilities.

View on arXiv PDF

Similar