CLJul 12, 2024
ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe PromptsAmelia F. Hardy, Houjun Liu, Allie Griffith et al.
Existing LLM red-teaming approaches prioritize high attack success rate, often resulting in high-perplexity prompts. This focus overlooks low-perplexity attacks that are more difficult to filter, more likely to arise during benign usage, and more impactful as negative downstream training examples. In response, we introduce ASTPrompter, a single-step optimization method that uses contrastive preference learning to train an attacker to maintain low perplexity while achieving a high attack success rate (ASR). ASTPrompter achieves an attack success rate 5.1 times higher on Llama-8.1B while using inputs that are 2.1 times more likely to occur according to the frozen LLM. Furthermore, our attack transfers to Mistral-7B, Qwen-7B, and TinyLlama in both black- and white-box settings. Lastly, by tuning a single hyperparameter in our method, we discover successful attack prefixes along an efficient frontier between ASR and perplexity, highlighting perplexity as a previously under-considered factor in red-teaming.
CYFeb 6
The Doctor Will (Still) See You Now: On the Structural Limits of Agentic AI in HealthcareGabriela Aránguiz Dias, Kiana Jafari, Allie Griffith et al.
Across healthcare, agentic artificial intelligence (AI) systems are increasingly promoted as capable of autonomous action, yet in practice they currently operate under near-total human oversight due to safety, regulatory, and liability constraints that make autonomous clinical reasoning infeasible in high-stakes environments. While market enthusiasm suggests a revolution in healthcare agents, the conceptual assumptions and accountability structures shaping these systems remain underexamined. We present a qualitative study based on interviews with 20 stakeholders, including developers, implementers, and end users. Our analysis identifies three mutually reinforcing tensions: conceptual fragmentation regarding the definition of `agentic'; an autonomy contradiction where commercial promises exceed operational reality; and an evaluation blind spot that prioritizes technical benchmarks over sociotechnical safety. We argue that agentic {AI} functions as a site of contested meaning-making where technical aspirations, commercial incentives, and clinical constraints intersect, carrying material consequences for patient safety and the distribution of blame.
AIDec 7, 2024
More than Marketing? On the Information Value of AI Benchmarks for PractitionersAmelia Hardy, Anka Reuel, Kiana Jafari Meimandi et al.
Public AI benchmark results are widely broadcast by model developers as indicators of model quality within a growing and competitive market. However, these advertised scores do not necessarily reflect the traits of interest to those who will ultimately apply AI models. In this paper, we seek to understand if and how AI benchmarks are used to inform decision-making. Based on the analyses of interviews with 19 individuals who have used, or decided against using, benchmarks in their day-to-day work, we find that across these settings, participants use benchmarks as a signal of relative performance difference between models. However, whether this signal was considered a definitive sign of model superiority, sufficient for downstream decisions, varied. In academia, public benchmarks were generally viewed as suitable measures for capturing research progress. By contrast, in both product and policy, benchmarks -- even those developed internally for specific tasks -- were often found to be inadequate for informing substantive decisions. Of the benchmarks deemed unsatisfactory, respondents reported that their goals were neither well-defined nor reflective of real-world use. Based on the study results, we conclude that effective benchmarks should provide meaningful, real-world evaluations, incorporate domain expertise, and maintain transparency in scope and goals. They must capture diverse, task-relevant capabilities, be challenging enough to avoid quick saturation, and account for trade-offs in model performance rather than relying on a single score. Additionally, proprietary data collection and contamination prevention are critical for producing reliable and actionable results. By adhering to these criteria, benchmarks can move beyond mere marketing tricks into robust evaluative frameworks.