LGMar 16

GASP: Guided Asymmetric Self-Play For Coding LLMs

arXiv:2603.1595791.23 citationsh-index: 10
AI Analysis

This addresses the challenge of improving coding capabilities in LLMs through more effective self-play, though it appears incremental as it builds on existing asymmetric self-play methods.

The paper tackles the problem of unguided asymmetric self-play in post-training large language models for coding, where not all hard problems are informative, by proposing Guided Asymmetric Self-Play (GASP) that uses real-data goalpost questions to guide training, resulting in a 2.5% improvement in pass@20 on LiveCodeBench and solving previously unsolvable hard questions.

Asymmetric self-play has emerged as a promising paradigm for post-training large language models, where a teacher continually generates questions for a student to solve at the edge of the student's learnability. Although these methods promise open-ended data generation bootstrapped from no human data, they suffer from one major problem: not all problems that are hard to solve are interesting or informative to improve the overall capabilities of the model. Current asymmetric self-play methods are goal-agnostic with no real grounding. We propose Guided Asymmetric Self-Play (GASP), where grounding is provided by real-data goalpost questions that are identified to pose a hard exploration challenge to the model. During self-play, the teacher first generates an easier variant of a hard question, and then a harder variant of that easier question, with the goal of gradually closing the gap to the goalpost throughout training. Doing so, we improve pass@20 on LiveCodeBench (LCB) by 2.5% over unguided asymmetric self-play, and through the curriculum constructed by the teacher, we manage to solve hard goalpost questions that remain out of reach for all baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes