Sarah Zhang

CL
h-index16
6papers
559citations
Novelty48%
AI Score41

6 Papers

LGJun 11, 2022
From Human Days to Machine Seconds: Automatically Answering and Generating Machine Learning Final Exams

Iddo Drori, Sarah J. Zhang, Reece Shuttleworth et al. · harvard

A final exam in machine learning at a top institution such as MIT, Harvard, or Cornell typically takes faculty days to write, and students hours to solve. We demonstrate that large language models pass machine learning finals at a human level, on finals available online after the models were trained, and automatically generate new human-quality final exam questions in seconds. Previous work has developed program synthesis and few-shot learning methods to solve university-level problem set questions in mathematics and STEM courses. In this work, we develop and compare methods that solve final exams, which differ from problem sets in several ways: the questions are longer, have multiple parts, are more complicated, and span a broader set of topics. We curate a dataset and benchmark of questions from machine learning final exams available online and code for answering these questions and generating new questions. We show how to generate new questions from other questions and course notes. For reproducibility and future research on this final exam benchmark, we use automatic checkers for multiple-choice, numeric, and questions with expression answers. We perform ablation studies comparing zero-shot learning with few-shot learning and chain-of-thought prompting using GPT-3, OPT, Codex, and ChatGPT across machine learning topics and find that few-shot learning methods perform best. We highlight the transformative potential of language models to streamline the writing and solution of large-scale assessments, significantly reducing the workload from human days to mere machine seconds. Our results suggest that rather than banning large language models such as ChatGPT in class, instructors should teach students to harness them by asking students meta-questions about correctness, completeness, and originality of the responses generated, encouraging critical thinking in academic studies.

CLNov 12, 2025
Stabilizing Reinforcement Learning for Honesty Alignment in Language Models on Deductive Reasoning

Jiarui Liu, Kaustubh Dhole, Yingheng Wang et al.

Reinforcement learning with verifiable rewards (RLVR) has recently emerged as a promising framework for aligning language models with complex reasoning objectives. However, most existing methods optimize only for final task outcomes, leaving models vulnerable to collapse when negative rewards dominate early training. This challenge is especially pronounced in honesty alignment, where models must not only solve answerable queries but also identify when conclusions cannot be drawn from the given premises. Deductive reasoning provides an ideal testbed because it isolates reasoning capability from reliance on external factual knowledge. To investigate honesty alignment, we curate two multi-step deductive reasoning datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that GRPO, with or without supervised fine tuning initialization, struggles on these tasks. Through extensive experiments across three models, we evaluate stabilization strategies and show that curriculum learning provides some benefit but requires carefully designed in distribution datasets with controllable difficulty. To address these limitations, we propose Anchor, a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse. Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance, underscoring the importance of training dynamics for enabling reliable deductive reasoning in aligned language models.

CLMay 6, 2025
A Reasoning-Focused Legal Retrieval Benchmark

Lucia Zheng, Neel Guha, Javokhir Arifov et al.

As the legal community increasingly examines the use of large language models (LLMs) for various legal applications, legal AI developers have turned to retrieval-augmented LLMs ("RAG" systems) to improve system performance and robustness. An obstacle to the development of specialized RAG systems is the lack of realistic legal RAG benchmarks which capture the complexity of both legal retrieval and downstream legal question-answering. To address this, we introduce two novel legal RAG benchmarks: Bar Exam QA and Housing Statute QA. Our tasks correspond to real-world legal research tasks, and were produced through annotation processes which resemble legal research. We describe the construction of these benchmarks and the performance of existing retriever pipelines. Our results suggest that legal RAG remains a challenging application, thus motivating future research.

AINov 17, 2025
WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance

Genglin Liu, Shijie Geng, Sha Li et al.

Multimodal LLM-powered agents have recently demonstrated impressive capabilities in web navigation, enabling agents to complete complex browsing tasks across diverse domains. However, current agents struggle with repetitive errors and lack the ability to learn from past experiences across sessions, limiting their long-term robustness and sample efficiency. We introduce WebCoach, a model-agnostic self-evolving framework that equips web browsing agents with persistent cross-session memory, enabling improved long-term planning, reflection, and continual learning without retraining. WebCoach consists of three key components: (1) a WebCondenser, which standardizes raw navigation logs into concise summaries; (2) an External Memory Store, which organizes complete trajectories as episodic experiences; and (3) a Coach, which retrieves relevant experiences based on similarity and recency, and decides whether to inject task-specific advice into the agent via runtime hooks. This design empowers web agents to access long-term memory beyond their native context window, improving robustness in complex browsing tasks. Moreover, WebCoach achieves self-evolution by continuously curating episodic memory from new navigation trajectories, enabling agents to improve over time without retraining. Evaluations on the WebVoyager benchmark demonstrate that WebCoach consistently improves the performance of browser-use agents across three different LLM backbones. With a 38B model, it increases task success rates from 47% to 61% while reducing or maintaining the average number of steps. Notably, smaller base models with WebCoach achieve performance comparable to the same web agent using GPT-4o.

LGDec 31, 2021
A Neural Network Solves, Explains, and Generates University Math Problems by Program Synthesis and Few-Shot Learning at Human Level

Iddo Drori, Sarah Zhang, Reece Shuttleworth et al.

We demonstrate that a neural network pre-trained on text and fine-tuned on code solves mathematics course problems, explains solutions, and generates new questions at a human level. We automatically synthesize programs using few-shot learning and OpenAI's Codex transformer and execute them to solve course problems at 81% automatic accuracy. We curate a new dataset of questions from MIT's largest mathematics courses (Single Variable and Multivariable Calculus, Differential Equations, Introduction to Probability and Statistics, Linear Algebra, and Mathematics for Computer Science) and Columbia University's Computational Linear Algebra. We solve questions from a MATH dataset (on Prealgebra, Algebra, Counting and Probability, Intermediate Algebra, Number Theory, and Precalculus), the latest benchmark of advanced mathematics problems designed to assess mathematical reasoning. We randomly sample questions and generate solutions with multiple modalities, including numbers, equations, and plots. The latest GPT-3 language model pre-trained on text automatically solves only 18.8% of these university questions using zero-shot learning and 30.8% using few-shot learning and the most recent chain of thought prompting. In contrast, program synthesis with few-shot learning using Codex fine-tuned on code generates programs that automatically solve 81% of these questions. Our approach improves the previous state-of-the-art automatic solution accuracy on the benchmark topics from 8.8% to 81.1%. We perform a survey to evaluate the quality and difficulty of generated questions. This work is the first to automatically solve university-level mathematics course questions at a human level and the first work to explain and generate university-level mathematics course questions at scale, a milestone for higher education.

IVAug 20, 2020
Deep learning-based transformation of the H&E stain into special stains

Kevin de Haan, Yijie Zhang, Jonathan E. Zuckerman et al.

Pathology is practiced by visual inspection of histochemically stained slides. Most commonly, the hematoxylin and eosin (H&E) stain is used in the diagnostic workflow and it is the gold standard for cancer diagnosis. However, in many cases, especially for non-neoplastic diseases, additional "special stains" are used to provide different levels of contrast and color to tissue components and allow pathologists to get a clearer diagnostic picture. In this study, we demonstrate the utility of supervised learning-based computational stain transformation from H&E to different special stains (Masson's Trichrome, periodic acid-Schiff and Jones silver stain) using tissue sections from kidney needle core biopsies. Based on evaluation by three renal pathologists, followed by adjudication by a fourth renal pathologist, we show that the generation of virtual special stains from existing H&E images improves the diagnosis in several non-neoplastic kidney diseases sampled from 58 unique subjects. A second study performed by three pathologists found that the quality of the special stains generated by the stain transformation network was statistically equivalent to those generated through standard histochemical staining. As the transformation of H&E images into special stains can be achieved within 1 min or less per patient core specimen slide, this stain-to-stain transformation framework can improve the quality of the preliminary diagnosis when additional special stains are needed, along with significant savings in time and cost, reducing the burden on healthcare system and patients.