AI LGApr 1

In harmony with gpt-oss

arXiv:2604.0036267.6h-index: 9Has Code

Predicted impact top 52% in AI · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the reproducibility gap in AI research by enabling independent verification of model performance, though it is incremental as it focuses on replicating existing results rather than advancing new capabilities.

The authors tackled the problem of independently reproducing OpenAI's published scores for GPT-OSS-20B with tools by reverse-engineering the model's in-distribution tools and building a native agent harness, achieving scores such as 60.4% on SWE Verified HIGH (vs. published 60.7%) and 91.7% on AIME25 with tools (vs. 90.4%).

No one has independently reproduced OpenAI's published scores for gpt-oss-20b with tools, because the original paper discloses neither the tools nor the agent harness. We reverse-engineered the model's in-distribution tools: when prompted without tool definitions, gpt-oss still calls tools from its training distribution with high statistical confidence -- a strong prior, not a hallucination. We then built a native harmony agent harness (https://github.com/borislavmavrin/harmonyagent.git) that encodes messages in the model's native format, bypassing the lossy Chat Completions conversion. Together, these yield the first independent reproduction of OpenAI's published scores: 60.4% on SWE Verified HIGH (published 60.7%), 53.3% MEDIUM (53.2%), and 91.7% on AIME25 with tools (90.4%).

View on arXiv PDF Code

Similar