LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations
This benchmark addresses the problem of assessing long-context imitation learning for AI researchers, though it is incremental as it focuses on evaluation rather than new methods.
The paper introduces LMAct, a benchmark to evaluate frontier models' ability to learn from long multimodal demonstrations (up to one million tokens) across tasks like tic-tac-toe and Atari, finding that models rarely reach expert performance and often show little improvement with more demonstrations.
In this paper, we present a benchmark to pressure-test today's frontier models' multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether these models can learn from large numbers of expert demonstrations in their context. We evaluate the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash Experimental, GPT-4o, o1-mini, o1-preview, and o1 as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We study increasing amounts of expert demonstrations in the context $\unicode{x2013}$ from no demonstrations to 512 full episodes. Across our tasks, models rarely manage to fully reach expert performance, and often, presenting more demonstrations has little effect. Some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. To help quantify the impact of other approaches and future innovations, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.