AIFeb 25
A Framework for Assessing AI Agent Decisions and Outcomes in AutoML PipelinesGaoyuan Du, Amit Ahlawat, Xiaoyang Liu et al.
Agent-based AutoML systems rely on large language models to make complex, multi-stage decisions across data processing, model selection, and evaluation. However, existing evaluation practices remain outcome-centric, focusing primarily on final task performance. Through a review of prior work, we find that none of the surveyed agentic AutoML systems report structured, decision-level evaluation metrics intended for post-hoc assessment of intermediate decision quality. To address this limitation, we propose an Evaluation Agent (EA) that performs decision-centric assessment of AutoML agents without interfering with their execution. The EA is designed as an observer that evaluates intermediate decisions along four dimensions: decision validity, reasoning consistency, model quality risks beyond accuracy, and counterfactual decision impact. Across four proof-of-concept experiments, we demonstrate that the EA can (i) detect faulty decisions with an F1 score of 0.919, (ii) identify reasoning inconsistencies independent of final outcomes, and (iii) attribute downstream performance changes to agent decisions, revealing impacts ranging from -4.9\% to +8.3\% in final metrics. These results illustrate how decision-centric evaluation exposes failure modes that are invisible to outcome-only metrics. Our work reframes the evaluation of agentic AutoML systems from an outcome-based perspective to one that audits agent decisions, offering a foundation for reliable, interpretable, and governable autonomous ML systems.
CROct 28, 2014
A Systematic Security Evaluation of Android's Multi-User FrameworkPaul Ratazzi, Yousra Aafer, Amit Ahlawat et al.
Like many desktop operating systems in the 1990s, Android is now in the process of including support for multi-user scenarios. Because these scenarios introduce new threats to the system, we should have an understanding of how well the system design addresses them. Since the security implications of multi-user support are truly pervasive, we developed a systematic approach to studying the system and identifying problems. Unlike other approaches that focus on specific attacks or threat models, ours systematically identifies critical places where access controls are not present or do not properly identify the subject and object of a decision. Finding these places gives us insight into hypothetical attacks that could result, and allows us to design specific experiments to test our hypothesis. Following an overview of the new features and their implementation, we describe our methodology, present a partial list of our most interesting hypotheses, and describe the experiments we used to test them. Our findings indicate that the current system only partially addresses the new threats, leaving the door open to a number of significant vulnerabilities and privacy issues. Our findings span a spectrum of root causes, from simple oversights, all the way to major system design problems. We conclude that there is still a long way to go before the system can be used in anything more than the most casual of sharing environments.