Umair Z. Ahmed

h-index7

6papers

165citations

Novelty40%

AI Score39

Ranked #77,440 of 194,257 authors (top 40%)#794 in SE (top 26%)

6 Papers

17.4AIOct 13, 2025

Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations

Suryaansh Jain, Umair Z. Ahmed, Shubham Sahai et al.

New Large Language Models (LLMs) become available every few weeks, and modern application developers confronted with the unenviable task of having to decide if they should switch to a new model. While human evaluation remains the gold standard, it is costly and unscalable. The state-of-the-art approach is to use LLMs as evaluators ( LLM-as-a-judge), but this suffers from a critical flaw: LLMs exhibit a strong positive bias. We provide empirical evidence showing that while LLMs can identify valid outputs with high accuracy (i.e., True Positive Rate 96%), they are remarkably poor at identifying invalid ones (i.e., True Negative Rate <25%). This systematic bias, coupled with class imbalance, often leads to inflated reliability scores. While ensemble-based methods like majority voting can help, we show that they are not good enough. We introduce an optimal minority-veto strategy that is resilient to missing data and mitigates this bias to a large extent. For scenarios requiring even higher precision, we propose a novel regression-based framework that directly models the validator bias using a small set of human-annotated ground truth data. On a challenging code feedback task over 366 high-school Python programs, our regression approach reduces the maximum absolute error to just 1.2%, achieving a 2x improvement over the best-performing ensemble of 14 state-of-the-art LLMs.

12.0SEJun 30, 2021

Verifix: Verified Repair of Programming Assignments

Umair Z. Ahmed, Zhiyu Fan, Jooyong Yi et al.

Automated feedback generation for introductory programming assignments is useful for programming education. Most works try to generate feedback to correct a student program by comparing its behavior with an instructor's reference program on selected tests. In this work, our aim is to generate verifiably correct program repairs as student feedback. The student assignment is aligned and composed with a reference solution in terms of control flow, and differences in data variables are automatically summarized via predicates to relate the variable names. Failed verification attempts for the equivalence of the two programs are exploited to obtain a collection of maxSMT queries, whose solutions point to repairs of the student assignment. We have conducted experiments on student assignments curated from a widely deployed intelligent tutoring system. Our results indicate that we can generate verified feedback in up to 58% of the assignments. More importantly, our system indicates when it is able to generate a verified feedback, which is then usable by novice students with high confidence.

12.6CYJun 17, 2020Code

Synthesizing Tasks for Block-based Programming

Umair Z. Ahmed, Maria Christakis, Aleksandr Efremov et al.

Block-based visual programming environments play a critical role in introducing computing concepts to K-12 students. One of the key pedagogical challenges in these environments is in designing new practice tasks for a student that match a desired level of difficulty and exercise specific programming concepts. In this paper, we formalize the problem of synthesizing visual programming tasks. In particular, given a reference visual task $\rm T^{in}$ and its solution code $\rm C^{in}$, we propose a novel methodology to automatically generate a set $\{(\rm T^{out}, \rm C^{out})\}$ of new tasks along with solution codes such that tasks $\rm T^{in}$ and $\rm T^{out}$ are conceptually similar but visually dissimilar. Our methodology is based on the realization that the mapping from the space of visual tasks to their solution codes is highly discontinuous; hence, directly mutating reference task $\rm T^{in}$ to generate new tasks is futile. Our task synthesis algorithm operates by first mutating code $\rm C^{in}$ to obtain a set of codes $\{\rm C^{out}\}$. Then, the algorithm performs symbolic execution over a code $\rm C^{out}$ to obtain a visual task $\rm T^{out}$; this step uses the Monte Carlo Tree Search (MCTS) procedure to guide the search in the symbolic tree. We demonstrate the effectiveness of our algorithm through an extensive empirical evaluation and user study on reference tasks taken from the \emph{Hour of Code: Classic Maze} challenge by \emph{Code.org} and the \emph{Intro to Programming with Karel} course by \emph{CodeHS.com}.

7.3SEMay 28, 2020Code

MACER: A Modular Framework for Accelerated Compilation Error Repair

Darshak Chhatbar, Umair Z. Ahmed, Purushottam Kar

Automated compilation error repair, the problem of suggesting fixes to buggy programs that fail to compile, has generated significant interest in recent years. Apart from being a tool of general convenience, automated code repair has significant pedagogical applications for novice programmers who find compiler error messages cryptic and unhelpful. Existing approaches largely solve this problem using a blackbox-application of a heavy-duty generative learning technique, such as sequence-to-sequence prediction (TRACER) or reinforcement learning (RLAssist). Although convenient, such black-box application of learning techniques makes existing approaches bulky in terms of training time, as well as inefficient at targeting specific error types. We present MACER, a novel technique for accelerated error repair based on a modular segregation of the repair process into repair identification and repair application. MACER uses powerful yet inexpensive discriminative learning techniques such as multi-label classifiers and rankers to first identify the type of repair required and then apply the suggested repair. Experiments indicate that the fine-grained approach adopted by MACER offers not only superior error correction, but also much faster training and prediction. On a benchmark dataset of 4K buggy programs collected from actual student submissions, MACER outperforms existing methods by 20% at suggesting fixes for popular errors that exactly match the fix desired by the student. MACER is also competitive or better than existing methods at all error types -- whether popular or rare. MACER offers a training time speedup of 2x over TRACER and 800x over RLAssist, and a test time speedup of 2-4x over both.

9.9SESep 2, 2019Code

Targeted Example Generation for Compilation Errors

Umair Z. Ahmed, Renuka Sindhgatta, Nisheeth Srivastava et al.

We present TEGCER, an automated feedback tool for novice programmers. TEGCER uses supervised classification to match compilation errors in new code submissions with relevant pre-existing errors, submitted by other students before. The dense neural network used to perform this classification task is trained on 15000+ error-repair code examples. The proposed model yields a test set classification Pred@3 accuracy of 97.7% across 212 error category labels. Using this model as its base, TEGCER presents students with the closest relevant examples of solutions for their specific error on demand.

5.9CYAug 12, 2016

Prutor: A System for Tutoring CS1 and Collecting Student Programs for Analysis

Rajdeep Das, Umair Z. Ahmed, Amey Karkare et al.

An introductory programming course (CS1) is an integral part of any undergraduate curriculum. Due to large number and diverse programming background of students, providing timely and personalised feedback to individual students is a challenging task for any CS1 instructor. The help provided by teaching assistants (typically senior students) is not sufficient as it suffers from unintentional bias and, most of the time, not quick enough. In this paper, we present Prutor, a tutoring system platform to conduct introductory programming courses. Prutor is a cloud-based web application that provides instant and useful feedback to students while solving programming problems. Prutor stores, at regular intervals, the snapshots of students' attempts to solve programming problems. These intermediate versions of the student programs provide the instructors (and data analysts) a view of the students' approach to solving programming problems. Since Prutor is accessible through any standard web browser, students do not need to worry about dependencies external to the programming course, viz. Operating Systems, Editors, Compilers, Compiler Options, etc.. This enables the students to focus on solving only the programming problems. Apart from the code snapshots at regular intervals, Prutor also collects other valuable data such as the time taken by the students to solve the problems, the number of compile and execution events, and the errors made. We have used this data in developing intelligent tools for giving feedback to students, some of which are described briefly in this paper. This system thus serves as a platform for tutoring as well as data collection for researchers.