Tianmai M. Zhang

CL
h-index23
4papers
70citations
Novelty29%
AI Score31

4 Papers

CLAug 19, 2024
BLADE: Benchmarking Language Model Agents for Data-Driven Science

Ken Gu, Ruoxi Shang, Ruien Jiang et al. · uw

Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, e.g., about which variables, transformations, and statistical models to consider. LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science. However, evaluating agents on such open-ended tasks is challenging due to multiple valid approaches, partially correct steps, and different ways to express the same decisions. To address these challenges, we present BLADE, a benchmark to automatically evaluate agents' multifaceted approaches to open-ended research questions. BLADE consists of 12 datasets and research questions drawn from existing scientific literature, with ground truth collected from independent analyses by expert data scientists and researchers. To automatically evaluate agent responses, we developed corresponding computational methods to match different representations of analyses to this ground truth. Though language models possess considerable world knowledge, our evaluation shows that they are often limited to basic analyses. However, agents capable of interacting with the underlying data demonstrate improved, but still non-optimal, diversity in their analytical decision making. Our work enables the evaluation of agents for data-driven science and provides researchers deeper insights into agents' analysis approaches.

CLDec 4, 2025
UW-BioNLP at ChemoTimelines 2025: Thinking, Fine-Tuning, and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction

Tianmai M. Zhang, Zhaoyi Sun, Sihang Zeng et al.

The ChemoTimelines shared task benchmarks methods for constructing timelines of systemic anticancer treatment from electronic health records of cancer patients. This paper describes our methods, results, and findings for subtask 2 -- generating patient chemotherapy timelines from raw clinical notes. We evaluated strategies involving chain-of-thought thinking, supervised fine-tuning, direct preference optimization, and dictionary-based lookup to improve timeline extraction. All of our approaches followed a two-step workflow, wherein an LLM first extracted chemotherapy events from individual clinical notes, and then an algorithm normalized and aggregated events into patient-level timelines. Each specific method differed in how the associated LLM was utilized and trained. Multiple approaches yielded competitive performances on the test set leaderboard, with fine-tuned Qwen3-14B achieving the best official score of 0.678. Our results and analyses could provide useful insights for future attempts on this task as well as the design of similar tasks.

CLMay 28, 2025
Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation

Tianmai M. Zhang, Neil F. Abernethy

Recent advancements in large language models have sparked interest in utilizing them to aid the peer review process of scientific publication amid the peer review crisis. However, having AI models generate full reviews in the same way as human reviewers risks exacerbating the irresponsible use of LLM-generated reviews. As an alternative, we propose adopting LLMs as manuscript quality checkers. We introduce several baseline approaches and an extendable automatic evaluation framework using top reasoning LLMs as judges to tackle the difficulty of recruiting domain experts for manual evaluation. Utilizing papers withdrawn from arXiv, we validated our proposed methods with several leading reasoning LLMs from multiple vendors and assessed their performance and API costs for identifying critical errors and unsoundness problems in scientific papers. o3 exhibited the best problem identification performance among all models at a modest cost. This paper provides insights into document-based scientific understanding/reasoning and lays a foundation for future applications. Our dataset, code, and model outputs are publicly available.

CLNov 9, 2024
Detecting Reference Errors in Scientific Literature with Large Language Models

Tianmai M. Zhang, Neil F. Abernethy

Reference errors, such as citation and quotation errors, are common in scientific papers. Such errors can result in the propagation of inaccurate information, but are difficult and time-consuming to detect, posing a significant challenge to scientific publishing. To support automatic detection of reference errors, this work evaluated the ability of large language models in OpenAI's GPT family to detect quotation errors. Specifically, we prepared an expert-annotated, general-domain dataset of statement-reference pairs from journal articles. Large language models were evaluated in different settings with varying amounts of reference information provided by retrieval augmentation. Our results showed that large language models are able to detect erroneous citations with limited context and without fine-tuning. This study contributes to the growing literature that seeks to utilize artificial intelligence to assist in the writing, reviewing, and publishing of scientific papers. Potential avenues for further improvements in this task are also discussed.