William E. Walden

h-index28

3papers

5citations

Novelty35%

AI Score36

Ranked #97,673 of 194,257 authors (top 50%)#5,981 in AI (top 48%)

3 Papers

7.5AIJan 12

Reasoning Models Will Blatantly Lie About Their Reasoning

William Walden

It has been shown that Large Reasoning Models (LRMs) may not *say what they think*: they do not always volunteer information about how certain parts of the input influence their reasoning. But it is one thing for a model to *omit* such information and another, worse thing to *lie* about it. Here, we extend the work of Chen et al. (2025) to show that LRMs will do just this: they will flatly deny relying on hints provided in the prompt in answering multiple choice questions -- even when directly asked to reflect on unusual (i.e. hinted) prompt content, even when allowed to use hints, and even though experiments *show* them to be using the hints. Our results thus have discouraging implications for CoT monitoring and interpretability.

7.3IRJan 19

Insider Knowledge: How Much Can RAG Systems Gain from Evaluation Secrets?

Laura Dietz, Bryan Li, Eugene Yang et al.

RAG systems are increasingly evaluated and optimized using LLM judges, an approach that is rapidly becoming the dominant paradigm for system assessment. Nugget-based approaches in particular are now embedded not only in evaluation frameworks but also in the architectures of RAG systems themselves. While this integration can lead to genuine improvements, it also creates a risk of faulty measurements due to circularity. In this paper, we investigate this risk through comparative experiments with nugget-based RAG systems, including Ginger and Crucible, against strong baselines such as GPT-Researcher. By deliberately modifying Crucible to generate outputs optimized for an LLM judge, we show that near-perfect evaluation scores can be achieved when elements of the evaluation - such as prompt templates or gold nuggets - are leaked or can be predicted. Our results highlight the importance of blind evaluation settings and methodological diversity to guard against mistaking metric overfitting for genuine system progress.

1.9CLOct 18, 2024Code

Cross-Document Event-Keyed Summarization

William Walden, Pavlo Kuchmiichuk, Alexander Martin et al.

Event-keyed summarization (EKS) requires summarizing a specific event described in a document given the document text and an event representation extracted from it. In this work, we extend EKS to the cross-document setting (CDEKS), in which summaries must synthesize information from accounts of the same event as given by multiple sources. We introduce SEAMUS (Summaries of Events Across Multiple Sources), a high-quality dataset for CDEKS based on an expert reannotation of the FAMUS dataset for cross-document argument extraction. We present a suite of baselines on SEAMUS -- covering both smaller, fine-tuned models, as well as zero- and few-shot prompted LLMs -- along with detailed ablations and a human evaluation study, showing SEAMUS to be a valuable benchmark for this new task.