Ivan Flechais

9papers

12citations

Novelty30%

AI Score46

Ranked #63,111 of 201,326 authors (top 31%)#404 in HC (top 14%)

9 Papers

58.3HCApr 20

The Collaboration Gap in Human-AI Work

Varad Vishwarupe, Marina Jirotka, Nigel Shadbolt et al. · mit, oxford

LLMs are increasingly presented as collaborators in programming, design, writing, and analysis. Yet the practical experience of working with them often falls short of this promise. In many settings, users must diagnose misunderstandings, reconstruct missing assumptions, and repeatedly repair misaligned responses. This poster introduces a conceptual framework for understanding why such collaboration remains fragile. Drawing on a constructivist grounded theory analysis of 16 interviews with designers, developers, and applied AI practitioners working on LLM-enabled systems, and informed by literature on human-AI collaboration, we argue that stable collaboration depends not only on model capability but on the interaction's grounding conditions. We distinguish three recurrent structures of human-AI work: one-shot assistance, weak collaboration with asymmetric repair, and grounded collaboration. We propose that collaboration breaks down when the appearance of partnership outpaces the grounding capacity of the interaction and contribute a framework for discussing grounding, repair, and interaction structure in LLM-enabled work.

58.2AIMay 6

Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone

Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka et al. · mit, oxford

Alignment evaluation in machine learning has largely become evaluation of models. Influential benchmarks score model outputs under fixed inputs, such as truthfulness, instruction following, or pairwise preference, and these scores are often used to support claims about deployed alignment. This paper argues that deployment-relevant alignment cannot be inferred from model-level evaluation alone. Alignment claims should instead be indexed to the level at which evidence is collected: model-level, response-level, interaction-level, or deployment-level. Two studies support this position. First, a structured audit of eleven alignment benchmarks, extended to a sixteen-benchmark corpus, dual-coded against an eight-dimension rubric with Cohen's kappa = 0.87, finds that user-facing verification support is absent across every benchmark examined, while process steerability is nearly absent. The few interactional benchmarks identified, including tau-bench, CURATe, Rifts, and Common Ground, remain fragmented in coverage, and benchmark construction rather than data source determines what is measured. Second, a blinded cross-model stress test using 180 transcripts across three frontier models and four scaffolds finds that the same verification scaffold raises one model's verification support to ceiling while leaving another categorically unchanged. This shows that scaffold efficacy is model-dependent and that the gap identified by the audit cannot be closed at the model level alone. We propose a system-level evaluation agenda: alignment profiles instead of single scores, fixed-scaffolding protocols for comparable interactional evaluation, and reporting templates that make the inferential distance between evaluation evidence and deployment claims explicit.

28.2HCMar 15

To LLM, or Not to LLM: How Designers and Developers Navigate LLMs as Tools or Teammates

Varad Vishwarupe, Ivan Flechais, Nigel Shadbolt et al. · mit, oxford

Large language models (LLMs) are increasingly integrated into design and development workflows, yet decisions about their use are rarely binary or purely technical. We report findings from a constructivist grounded theory study based on interviews with 33 designers and developers across three large technology organisations. Rather than evaluating LLMs solely by capability, participants reasoned about the role an LLM could occupy within a workflow and how that role would interact with existing structures of responsibility and organisational accountability. When LLMs were framed as tools under clear human control, their use was typically acceptable and could be integrated within existing governance structures. When framed as teammates with shared or ambiguous agency, practitioners expressed hesitation, particularly when responsibility for outcomes could not be clearly justified. At the same time, participants also described productive teammate configurations in which LLMs supported collaborative reasoning while remaining embedded within explicit oversight structures. We identify tool and teammate framings as recurring ways in which designers and developers position LLMs relative to human work and present an analytic rubric describing how role framing shapes decision authority, accountability ownership, oversight strategies, and organisational acceptability. By foregrounding design-time reasoning, this work reframes To LLM or Not to LLM as a sociotechnical positioning problem that emerges during system design rather than during post-deployment evaluation.

14.4AIMay 12

The Evaluation Differential: When Frontier AI Models Recognise They Are Being Tested

Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka et al.

Recent published evidence from frontier laboratories shows that contemporary AI models can recognise evaluation contexts, latently represent them, and behave differently under those contexts than under deployment-continuous conditions. Anthropic's BrowseComp incident, the Natural Language Autoencoder findings on SWE-bench Verified and destructive-coding evaluations, and the OpenAI / Apollo anti-scheming work all document instances of this phenomenon. We argue that these findings create a claim-validity problem for safety conclusions drawn from frontier evaluations. We introduce the Evaluation Differential (ED), a conditional divergence in a target behavioural property between recognised-evaluation and deployment-continuous contexts, define a normalised effect-size form (nED) for cross-property comparison, and prove that marginal evaluation scores cannot identify ED. We develop a typology of safety claims (ED-stable, ED-degraded, ED-inverted, ED-undetermined) by their warrant-status under documented divergence, and specify TRACE (Test-Recognition Audit for Claim Evaluation), an audit protocol that wraps existing evaluation infrastructure and produces restricted claims rather than capability scores. We apply the framework retrospectively to three publicly documented evaluation incidents and discuss governance implications for system cards, conformity assessment, and the international network of AI safety and security institutes. TRACE does not eliminate adversarial adaptation; it disciplines the claims drawn from evaluation evidence by making explicit the conditions under which that evidence was produced.

86.7CYMay 5

NeurIPS Should Require Reproducibility Standards for Frontier AI Safety Claims

Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka et al.

Frontier AI safety claims - published assertions that a highly capable general-purpose model is below a threshold of concern, adequately mitigated, or suitable for release - increasingly shape model deployment, governance, and public trust. Yet the artefacts needed to evaluate them are routinely withheld, producing an evidential inversion: the most consequential claims in AI safety are often the least reproducible. This position paper argues that NeurIPS should require reproducibility standards for papers making such claims, treating non-reproducibility not as a transparency preference but as an evaluation-methodology failure. The 2026 International AI Safety Report [Bengio et al., 2026] concludes that reliable pre-deployment safety testing has become harder to conduct and that models now distinguish test from deployment contexts; the 2025 Foundation Model Transparency Index [Wan et al., 2025] reports a sector-average transparency score of 40/100 with no major developer adequately disclosing train-test overlap; contemporaneous measurement-theory work shows that attack-success-rate comparisons across systems are often founded on low-validity measurements [Chouldechova et al., 2025]. We propose a three-tier disclosure framework, distinguishing public, controlled, and claim-restricted disclosure, paired with a mandatory claim inventory, scope statements, and a phased implementation path with graduated sanctions. The framework treats secrecy and openness as endpoints of a spectrum, with controlled review (via a federated colloquium of qualified secure-review hosts) covering claims whose artefacts cannot be released publicly, and right-scaling claims whose artefacts cannot be reviewed even confidentially. The standard the community applies to its most consequential claims should be at least as high as the standard it applies to its least.

86.5HCApr 26

From Rights to Rites: Expectations Management in Smart-Home AI

Varad Vishwarupe, Ivan Flechais, Marina Jirotka et al.

Domestic voice assistants and smart-home devices are increasingly embedded in everyday routines, yet their ethics are often treated as an afterthought or delegated to compliance teams. To explore how expectations about smart-home AI are constructed and managed, we conducted 33 semi-structured interviews with designers, developers, and researchers from major smart-home platforms (Amazon Alexa, Microsoft Azure IoT, and Google Nest). Using a constructivist grounded theory approach, we develop Expectations Management (EM): a culturally embedded model describing how practitioners shape, calibrate, and repair expectations by balancing organisational rights with culturally situated rites. We show that EM differs from expectation-confirmation theory and trust-calibration by foregrounding moral judgement, situated action, and cross-cultural variation. Our analysis reveals four recurring design tensions: automation vs. autonomy, helpfulness vs. intrusiveness, personalisation vs. predictability, and transparency vs. obscurity and distils them into a five-phase EM Design Playbook that supports moral prudence. We discuss implications for responsible smart-home design and offer guidance for human-centred AI.

CRAug 11, 2020

Security should be there by default: Investigating how journalists perceive and respond to risks from the Internet of Things

Anjuli R. K. Shere, Jason R. C. Nurse, Ivan Flechais

Journalists have long been the targets of both physical and cyber-attacks from well-resourced adversaries. Internet of Things (IoT) devices are arguably a new avenue of threat towards journalists through both targeted and generalised cyber-physical exploitation. This study comprises three parts: First, we interviewed 11 journalists and surveyed 5 further journalists, to determine the extent to which journalists perceive threats through the IoT, particularly via consumer IoT devices. Second, we surveyed 34 cyber security experts to establish if and how lay-people can combat IoT threats. Third, we compared these findings to assess journalists' knowledge of threats, and whether their protective mechanisms would be effective against experts' depictions and predictions of IoT threats. Our results indicate that journalists generally are unaware of IoT-related risks and are not adequately protecting themselves; this considers cases where they possess IoT devices, or where they enter IoT-enabled environments (e.g., at work or home). Expert recommendations spanned both immediate and long-term mitigation methods, including practical actions that are technical and socio-political in nature. However, all proposed individual mitigation methods are likely to be short-term solutions, with 26 of 34 (76.5%) of cyber security experts responding that within the next five years it will not be possible for the public to opt-out of interaction with the IoT.

HCMar 10, 2020

Further Exploring Communal Technology Use in Smart Homes: Social Expectations

Martin J. Kraemer, Ulrik Lyngs, Helena Webb et al.

Device use in smart homes is becoming increasingly communal, requiring cohabitants to navigate a complex social and technological context. In this paper, we report findings from an exploratory survey grounded in our prior work on communal technology use in the home [4]. The findings highlight the importance of considering qualities of social relationships and technology in understanding expectations and intentions of communal technology use. We propose a design perspective of social expectations, and we suggest existing designs can be expanded using already available information such as location, and considering additional information, such as levels of trust and reliability.

HCJun 17, 2019

Informing The Future of Data Protection in Smart Homes

Martin J Kraemer, William Seymour, Reuben Binns et al.

Recent changes to data protection regulation, particularly in Europe, are changing the design landscape for smart devices, requiring new design techniques to ensure that devices are able to adequately protect users' data. A particularly interesting space in which to explore and address these challenges is the smart home, which presents a multitude of difficult social and technical problems in an intimate and highly private context. This position paper outlines the motivation and research approach of a new project aiming to inform the future of data protection by design and by default in smart homes through a combination of ethnography and speculative design.