Philip Bontrager

h-index10

12papers

1,016citations

Novelty42%

AI Score41

Ranked #68,339 of 194,257 authors (top 35%)#15,370 in LG (top 38%)

12 Papers

11.1AIAug 11, 2025

GVGAI-LLM: Evaluating Large Language Model Agents with Infinite Games

Yuchen Li, Cong Lin, Muhammad Umair Nasir et al.

We introduce GVGAI-LLM, a video game benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). Built on the General Video Game AI framework, it features a diverse collection of arcade-style games designed to test a model's ability to handle tasks that differ from most existing LLM benchmarks. The benchmark leverages a game description language that enables rapid creation of new games and levels, helping to prevent overfitting over time. Each game scene is represented by a compact set of ASCII characters, allowing for efficient processing by language models. GVGAI-LLM defines interpretable metrics, including the meaningful step ratio, step efficiency, and overall score, to assess model behavior. Through zero-shot evaluations across a broad set of games and levels with diverse challenges and skill depth, we reveal persistent limitations of LLMs in spatial reasoning and basic planning. Current models consistently exhibit spatial and logical errors, motivating structured prompting and spatial grounding techniques. While these interventions lead to partial improvements, the benchmark remains very far from solved. GVGAI-LLM provides a reproducible testbed for advancing research on language model capabilities, with a particular emphasis on agentic behavior and contextual reasoning.

19.2LGMay 6, 2021Code

Learning Controllable Content Generators

Sam Earle, Maria Edwards, Ahmed Khalifa et al.

It has recently been shown that reinforcement learning can be used to train generators capable of producing high-quality game levels, with quality defined in terms of some user-specified heuristic. To ensure that these generators' output is sufficiently diverse (that is, not amounting to the reproduction of a single optimal level configuration), the generation process is constrained such that the initial seed results in some variance in the generator's output. However, this results in a loss of control over the generated content for the human user. We propose to train generators capable of producing controllably diverse output, by making them "goal-aware." To this end, we add conditional inputs representing how close a generator is to some heuristic, and also modify the reward mechanism to incorporate that value. Testing on multiple domains, we show that the resulting level generators are capable of exploring the space of possible levels in a targeted, controllable manner, producing levels of comparable quality as their goal-unaware counterparts, that are diverse along designer-specified dimensions.

6.1AIFeb 20, 2021

Game Mechanic Alignment Theory and Discovery

Michael Cerny Green, Ahmed Khalifa, Philip Bontrager et al.

We present a new concept called Game Mechanic Alignment theory as a way to organize game mechanics through the lens of systemic rewards and agential motivations. By disentangling player and systemic influences, mechanics may be better identified for use in an automated tutorial generation system, which could tailor tutorials for a particular playstyle or player. Within, we apply this theory to several well-known games to demonstrate how designers can benefit from it, we describe a methodology for how to estimate "mechanic alignment", and we apply this methodology on multiple games in the GVGAI framework. We discuss how effectively this estimation captures agential motivations and systemic rewards and how our theory could be used as an alternative way to find mechanics for tutorial generation.

23.4AIFeb 12, 2020Code

Learning to Generate Levels From Nothing

Philip Bontrager, Julian Togelius

Machine learning for procedural content generation has recently become an active area of research. Levels vary in both form and function and are mostly unrelated to each other across games. This has made it difficult to assemble suitably large datasets to bring machine learning to level design in the same way as it's been used for image generation. Here we propose Generative Playing Networks which design levels for itself to play. The algorithm is built in two parts; an agent that learns to play game levels, and a generator that learns the distribution of playable levels. As the agent learns and improves its ability, the space of playable levels, as defined by the agent, grows. The generator targets the agent's playability estimates to then update its understanding of what constitutes a playable level. We call this process of learning the distribution of data found through self-discovery with an environment, self-supervised inductive learning. Unlike previous approaches to procedural content generation, Generative Playing Networks are end-to-end differentiable and do not require human-designed examples or domain knowledge. We demonstrate the capability of this framework by training an agent and level generator for a 2D dungeon crawler game.

17.4LGJan 27, 2020Code

Rotation, Translation, and Cropping for Zero-Shot Generalization

Chang Ye, Ahmed Khalifa, Philip Bontrager et al.

Deep Reinforcement Learning (DRL) has shown impressive performance on domains with visual inputs, in particular various games. However, the agent is usually trained on a fixed environment, e.g. a fixed number of levels. A growing mass of evidence suggests that these trained models fail to generalize to even slight variations of the environments they were trained on. This paper advances the hypothesis that the lack of generalization is partly due to the input representation, and explores how rotation, cropping and translation could increase generality. We show that a cropped, translated and rotated observation can get better generalization on unseen levels of two-dimensional arcade games from the GVGAI framework. The generality of the agents is evaluated on both human-designed and procedurally generated levels.

27.3LGJan 24, 2020Code

PCGRL: Procedural Content Generation via Reinforcement Learning

Ahmed Khalifa, Philip Bontrager, Sam Earle et al.

We investigate how reinforcement learning can be used to train level-designing agents. This represents a new approach to procedural content generation in games, where level design is framed as a game, and the content generator itself is learned. By seeing the design problem as a sequential task, we can use reinforcement learning to learn how to take the next action so that the expected final level quality is maximized. This approach can be used when few or no examples exist to train from, and the trained generator is very fast. We investigate three different ways of transforming two-dimensional level design problems into Markov decision processes and apply these to three game environments.

7.1LGAug 12, 2019

Superstition in the Network: Deep Reinforcement Learning Plays Deceptive Games

Philip Bontrager, Ahmed Khalifa, Damien Anderson et al.

Deep reinforcement learning has learned to play many games well, but failed on others. To better characterize the modes and reasons of failure of deep reinforcement learners, we test the widely used Asynchronous Actor-Critic (A2C) algorithm on four deceptive games, which are specially designed to provide challenges to game-playing agents. These games are implemented in the General Video Game AI framework, which allows us to compare the behavior of reinforcement learning-based agents with planning agents based on tree search. We find that several of these games reliably deceive deep reinforcement learners, and that the resulting behavior highlights the shortcomings of the learning algorithm. The particular ways in which agents fail differ from how planning-based agents fail, further illuminating the character of these algorithms. We propose an initial typology of deceptions which could help us better understand pitfalls and failure modes of (deep) reinforcement learning.

27.4LGJun 28, 2018Code

Illuminating Generalization in Deep Reinforcement Learning through Procedural Level Generation

Niels Justesen, Ruben Rodriguez Torrado, Philip Bontrager et al.

Deep reinforcement learning (RL) has shown impressive results in a variety of domains, learning directly from high-dimensional sensory streams. However, when neural networks are trained in a fixed environment, such as a single level in a video game, they will usually overfit and fail to generalize to new levels. When RL models overfit, even slight modifications to the environment can result in poor agent performance. This paper explores how procedurally generated levels during training can increase generality. We show that for some games procedural level generation enables generalization to new levels within the same distribution. Additionally, it is possible to achieve better performance with less data by manipulating the difficulty of the levels in response to the performance of the agent. The generality of the learned behaviors is also evaluated on a set of human-designed levels. The results suggest that the ability to generalize to human-designed levels highly depends on the design of the level generators. We apply dimensionality reduction and clustering techniques to visualize the generators' distributions of levels and analyze to what degree they can produce levels similar to those designed by a human.

18.1LGJun 6, 2018Code

Deep Reinforcement Learning for General Video Game AI

Ruben Rodriguez Torrado, Philip Bontrager, Julian Togelius et al.

The General Video Game AI (GVGAI) competition and its associated software framework provides a way of benchmarking AI algorithms on a large number of games written in a domain-specific description language. While the competition has seen plenty of interest, it has so far focused on online planning, providing a forward model that allows the use of algorithms such as Monte Carlo Tree Search. In this paper, we describe how we interface GVGAI to the OpenAI Gym environment, a widely used way of connecting agents to reinforcement learning problems. Using this interface, we characterize how widely used implementations of several deep reinforcement learning algorithms fare on a number of GVGAI games. We further analyze the results to provide a first indication of the relative difficulty of these games relative to each other, and relative to those in the Arcade Learning Environment under similar conditions.

21.6NEJan 24, 2018

Deep Interactive Evolution

Philip Bontrager, Wending Lin, Julian Togelius et al.

This paper describes an approach that combines generative adversarial networks (GANs) with interactive evolutionary computation (IEC). While GANs can be trained to produce lifelike images, they are normally sampled randomly from the learned distribution, providing limited control over the resulting output. On the other hand, interactive evolution has shown promise in creating various artifacts such as images, music and 3D objects, but traditionally relies on a hand-designed evolvable representation of the target domain. The main insight in this paper is that a GAN trained on a specific target domain can act as a compact and robust genotype-to-phenotype mapping (i.e. most produced phenotypes do resemble valid domain artifacts). Once such a GAN is trained, the latent vector given as input to the GAN's generator network can be put under evolutionary control, allowing controllable and high-quality image generation. In this paper, we demonstrate the advantage of this novel approach through a user study in which participants were able to evolve images that strongly resemble specific target images.

29.7AIAug 25, 2017

Deep Learning for Video Game Playing

Niels Justesen, Philip Bontrager, Julian Togelius et al.

In this article, we review recent Deep Learning advances in the context of how they have been applied to play different types of video games such as first-person shooters, arcade games, and real-time strategy games. We analyze the unique requirements that different game genres pose to a deep learning system and highlight important open challenges in the context of applying these machine learning methods to video games, such as general game playing, dealing with extremely large decision spaces and sparse rewards.

19.7CVMay 21, 2017

DeepMasterPrints: Generating MasterPrints for Dictionary Attacks via Latent Variable Evolution

Philip Bontrager, Aditi Roy, Julian Togelius et al.

Recent research has demonstrated the vulnerability of fingerprint recognition systems to dictionary attacks based on MasterPrints. MasterPrints are real or synthetic fingerprints that can fortuitously match with a large number of fingerprints thereby undermining the security afforded by fingerprint systems. Previous work by Roy et al. generated synthetic MasterPrints at the feature-level. In this work we generate complete image-level MasterPrints known as DeepMasterPrints, whose attack accuracy is found to be much superior than that of previous methods. The proposed method, referred to as Latent Variable Evolution, is based on training a Generative Adversarial Network on a set of real fingerprint images. Stochastic search in the form of the Covariance Matrix Adaptation Evolution Strategy is then used to search for latent input variables to the generator network that can maximize the number of impostor matches as assessed by a fingerprint recognizer. Experiments convey the efficacy of the proposed method in generating DeepMasterPrints. The underlying method is likely to have broad applications in fingerprint security as well as fingerprint synthesis.