BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning

arXiv:2603.04124v14.41 citationsh-index: 117

Originality Incremental advance

AI Analysis

This work is significant for researchers in AI and physics, as it explores the limitations of outcome-level alignment in teaching robust scientific reasoning to language models, suggesting that even precise, analytically exact rewards do not guarantee transferable physical understanding.

This paper investigates whether reinforcement learning with verifiable rewards can teach a compact 1.5B-parameter language model to reason about physics, specifically beam statics. The best BeamPERL model achieved a 66.7% improvement in Pass@1 over the base model, but its competence was anisotropic, generalizing compositionally but failing under topological shifts.

Can reinforcement learning with hard, verifiable rewards teach a compact language model to reason about physics, or does it primarily learn to pattern-match toward correct answers? We study this question by training a 1.5B-parameter reasoning model on beam statics, a classic engineering problem, using parameter-efficient RLVR with binary correctness rewards from symbolic solvers, without teacher-generated reasoning traces. The best BeamPERL checkpoint achieves a 66.7% improvement in Pass@1 over the base model. However, the learned competence is anisotropic: the model generalizes compositionally (more loads) but fails under topological shifts (moved supports) that require the same equilibrium equations. Intermediate checkpoints yield the strongest reasoning, while continued optimization degrades robustness while maintaining reward. These findings reveal a key limitation of outcome-level alignment: reinforcement learning with exact physics rewards induces procedural solution templates rather than internalization of governing equations. The precision of the reward signal - even when analytically exact - does not by itself guarantee transferable physical reasoning. Our results suggest that verifiable rewards may need to be paired with structured reasoning scaffolding to move beyond template matching toward robust scientific reasoning.

View on arXiv PDF

Similar