Md Jobayer Rahman Rafy

14.7CLJul 31, 2025Code

PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

Oshayer Siddique, J. M Areeb Uzair Alam, Md Jobayer Rahman Rafy et al.

The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems - a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, ${\rm P{\small HYSICS}E{\small VAL}}$, consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at https://github.com/areebuzair/PhysicsEval.

AIJun 5Code

Statistically Grounded Sparse-Feature Interventions for Activation-Space Control in Large Language Models

Oshayer Siddique, J. M Areeb Uzair Alam, Md Jobayer Rahman Rafy et al.

Activation steering offers a lightweight alternative to fine-tuning for behavioral control of large language models, but SAE-based steering methods often rely on learned steering objectives or single-criterion feature selection. We introduce a transparent SAE-feature steering pipeline that first applies a six-condition reliability filter, then ranks sparse features through an unweighted Borda consensus over three complementary statistics: $F$-test, KSG mutual information, and Cohen's $d$. The resulting steering direction is constructed as a Cohen's-$d$-weighted combination of SAE decoder rows, providing an optimization-free direction motivated by Fisher-LDA under approximate SAE-feature decorrelation. Across three Gemma-family models, four behavioral domains, and 356 layer-strength configurations, the method produces measurable domain-specific shifts while revealing a substantial gap between raw attribute movement and quality-preserving generation. In the strongest configuration, logical-correctness steering reaches a primary-score delta of $+1.16$ in Gemma~2 9B; however, our broader finding is that usable steering is highly localized by model, domain, layer, and strength. These results argue that activation-steering evaluations should report quality-conditioned success alongside raw behavioral shift. Our code and data are available at https://github.com/Oshayer-Siddique/LLM-Steering-Using-SAE.

Md Jobayer Rahman Rafy

2 Papers