AI LGJun 6, 2023

Agents Explore the Environment Beyond Good Actions to Improve Their Model for Better Decisions

arXiv:2306.03408v12.1h-index: 2Has Code

Originality Synthesis-oriented

AI Analysis

This is an incremental improvement for AI agents in reinforcement learning, specifically addressing planning failures in MuZero-like systems.

The paper tackles the problem of poor decision-making in agents when model predictions are inaccurate, by introducing an exploration method that combines normal planning with random deviations to improve value learning, demonstrating improved decision-making in Tic-Tac-Toe.

Improving the decision-making capabilities of agents is a key challenge on the road to artificial intelligence. To improve the planning skills needed to make good decisions, MuZero's agent combines prediction by a network model and planning by a tree search using the predictions. MuZero's learning process can fail when predictions are poor but planning requires them. We use this as an impetus to get the agent to explore parts of the decision tree in the environment that it otherwise would not explore. The agent achieves this, first by normal planning to come up with an improved policy. Second, it randomly deviates from this policy at the beginning of each training episode. And third, it switches back to the improved policy at a random time step to experience the rewards from the environment associated with the improved policy, which is the basis for learning the correct value expectation. The simple board game Tic-Tac-Toe is used to illustrate how this approach can improve the agent's decision-making ability. The source code, written entirely in Java, is available at https://github.com/enpasos/muzero.

View on arXiv PDF Code

Similar