Michael S. Lee

CL
h-index26
8papers
34citations
Novelty48%
AI Score47

8 Papers

ROMar 3, 2022
Reasoning about Counterfactuals to Improve Human Inverse Reinforcement Learning

Michael S. Lee, Henny Admoni, Reid Simmons

To collaborate well with robots, we must be able to understand their decision making. Humans naturally infer other agents' beliefs and desires by reasoning about their observable behavior in a way that resembles inverse reinforcement learning (IRL). Thus, robots can convey their beliefs and desires by providing demonstrations that are informative for a human learner's IRL. An informative demonstration is one that differs strongly from the learner's expectations of what the robot will do given their current understanding of the robot's decision making. However, standard IRL does not model the learner's existing expectations, and thus cannot do this counterfactual reasoning. We propose to incorporate the learner's current understanding of the robot's decision making into our model of human IRL, so that a robot can select demonstrations that maximize the human's understanding. We also propose a novel measure for estimating the difficulty for a human to predict instances of a robot's behavior in unseen environments. A user study finds that our test difficulty measure correlates well with human performance and confidence. Interestingly, considering human beliefs and counterfactuals when selecting demonstrations decreases human performance on easy tests, but increases performance on difficult tests, providing insight on how to best utilize such models.

CLMay 13Code
ROK-FORTRESS: Measuring the Effect of Geopolitical Transcreation for National Security and Public Safety

Michael S. Lee, Yash Maurya, Drew Rein et al.

Safety evaluations for large language models (LLMs) increasingly target high-stakes National Security and Public Safety (NSPS) risks, yet multilingual safety is typically assessed through translation-only benchmarks that preserve the underlying scenario, and empirical evidence of how language and geopolitical context interact remains limited to a narrow set of language pairs. We introduce \emph{ROK-FORTRESS} https://huggingface.co/datasets/ScaleAI/ROK-FORTRESS_public, a bilingual, culturally adversarial NSPS benchmark that uses the English--Korean language pair and U.S.--ROK geopolitical axis as a case study, separating the effects of language and geopolitical grounding via a \emph{transcreation matrix}: adversarial intents are evaluated under controlled combinations of (i) English versus Korean language and (ii) U.S.\ versus Korean entities, institutions, and operational details. Each adversarial prompt is paired with a dual-use benign counterpart to quantify over-refusal. Model responses are then scored using calibrated LLM-as-a-judge panels, applying our expert-crafted, prompt-specific binary rubrics. Across a dual-track set of frontier and Korean-optimized models, we find a consistent suppression effect in Korean variants and substantial model-to-model variation in how geopolitical grounding interacts with language. In many models, Korean grounding mitigates the Korean language-driven suppression -- with no model showing significant amplification in the other direction -- indicating that, at least in the English--Korean case, safety behavior is shaped by language-as-risk signals and context interactions that translation-only evaluations miss. The transcreation matrix methodology is designed to generalize to other language--culture pairs.

AIJul 13, 2023
Leveraging Contextual Counterfactuals Toward Belief Calibration

Qiuyi, Zhang, Michael S. Lee et al.

Beliefs and values are increasingly being incorporated into our AI systems through alignment processes, such as carefully curating data collection principles or regularizing the loss function used for training. However, the meta-alignment problem is that these human beliefs are diverse and not aligned across populations; furthermore, the implicit strength of each belief may not be well calibrated even among humans, especially when trying to generalize across contexts. Specifically, in high regret situations, we observe that contextual counterfactuals and recourse costs are particularly important in updating a decision maker's beliefs and the strengths to which such beliefs are held. Therefore, we argue that including counterfactuals is key to an accurate calibration of beliefs during alignment. To do this, we first segment belief diversity into two categories: subjectivity (across individuals within a population) and epistemic uncertainty (within an individual across different contexts). By leveraging our notion of epistemic uncertainty, we introduce `the belief calibration cycle' framework to more holistically calibrate this diversity of beliefs with context-driven counterfactual reasoning by using a multi-objective optimization. We empirically apply our framework for finding a Pareto frontier of clustered optimal belief strengths that generalize across different contexts, demonstrating its efficacy on a toy dataset for credit decisions.

CLFeb 11
LHAW: Controllable Underspecification for Long-Horizon Tasks

George Pu, Michael S. Lee, Udari Madhushani Sehwag et al.

Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task-agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long-Horizon Augmented Workflows), a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions - Goals, Constraints, Inputs, and Context - at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome-critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.

CLOct 18, 2025
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

Yu Ying Chiu, Michael S. Lee, Rachel Calcott et al. · uw

As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.

CYApr 1, 2024
Closed-loop Teaching via Demonstrations to Improve Policy Transparency

Michael S. Lee, Reid Simmons, Henny Admoni

Demonstrations are a powerful way of increasing the transparency of AI policies. Though informative demonstrations may be selected a priori through the machine teaching paradigm, student learning may deviate from the preselected curriculum in situ. This paper thus explores augmenting a curriculum with a closed-loop teaching framework inspired by principles from the education literature, such as the zone of proximal development and the testing effect. We utilize tests accordingly to close to the loop and maintain a novel particle filter model of human beliefs throughout the learning process, allowing us to provide demonstrations that are targeted to the human's current understanding in real time. A user study finds that our proposed closed-loop teaching framework reduces the regret in human test responses by 43% over a baseline.

LGOct 4, 2019
Requirements for Developing Robust Neural Networks

John S. Hyatt, Michael S. Lee

Validation accuracy is a necessary, but not sufficient, measure of a neural network classifier's quality. High validation accuracy during development does not guarantee that a model is free of serious flaws, such as vulnerability to adversarial attacks or a tendency to misclassify (with high confidence) data it was not trained on. The model may also be incomprehensible to a human or base its decisions on unreasonable criteria. These problems, which are not unique to classifiers, have been the focus of a substantial amount of recent research. However, they are not prioritized during model development, which almost always optimizes on validation accuracy to the exclusion of everything else. The product of this approach is likely to fail in unexpected ways outside of the training environment. We believe that, in addition to validation accuracy, the model development process must give added weight to other performance metrics such as explainability, resistance to adversarial attacks, and overconfidence on out-of-distribution data.

ROFeb 15, 2018
3-D Volumetric Gamma-ray Imaging and Source Localization with a Mobile Robot

Michael S. Lee, Matthew Hanczor, Jiyang Chu et al.

Radiation detection has largely been a manual inspection process with point sensors such as Geiger-Muller counters and scintillation spectrometers to date. While their observations of source proximity prove useful, they lack the directional information necessary for efficient source localization and characterization in cluttered environments with multiple radiation sources. The recent commercialization of Compton gamma cameras provides directional information to the broader radiation detection community for the first time. This paper presents the integration of a Compton gamma camera with a self-localizing ground robot for accurate 3D radiation mapping. Using the position and orientation of the robot, radiation images from the gamma camera are accumulated over a traversed path in a shared frame of reference to construct a consistent voxel grid-based radiation map. The peaks of the map at pre-specified energy windows are selected as the source location estimates, which are compared to the ground truth source locations. The proposed approach localizes multiple sources to within an average of 0.2 m in two 5 x 4 m^2 and 14 x 6 m^2 laboratory environments.