LGJun 2, 2023
Evaluating Language Models for Mathematics through InteractionsKatherine M. Collins, Albert Q. Jiang, Simon Frieder et al. · cambridge
There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs, and is insufficient for making an informed decision about which LLMs and under which assistive settings can they be sensibly used. Static assessment fails to account for the essential interactive element in LLM deployment, and therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analysing MathConverse, we derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, amongst other findings. Further, we garner a more granular understanding of GPT-4 mathematical problem-solving through a series of case studies, contributed by expert mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models that communicate uncertainty respond well to user corrections, and are more interpretable and concise may constitute better assistants. Interactive evaluation is a promising way to navigate the capability of these models; humans should be aware of language models' algebraic fallibility and discern where they are appropriate to use.
CLFeb 18, 2023
Optimising Human-Machine Collaboration for Efficient High-Precision Information Extraction from Text DocumentsBradley Butcher, Miri Zilka, Darren Cook et al. · cambridge
While humans can extract information from unstructured text with high precision and recall, this is often too time-consuming to be practical. Automated approaches, on the other hand, produce nearly-immediate results, but may not be reliable enough for high-stakes applications where precision is essential. In this work, we consider the benefits and drawbacks of various human-only, human-machine, and machine-only information extraction approaches. We argue for the utility of a human-in-the-loop approach in applications where high precision is required, but purely manual extraction is infeasible. We present a framework and an accompanying tool for information extraction using weak-supervision labelling with human validation. We demonstrate our approach on three criminal justice datasets. We find that the combination of computer speed and human understanding yields precision comparable to manual annotation while requiring only a fraction of time, and significantly outperforms fully automated baselines in terms of precision.
CLSep 25, 2022
Can We Automate the Analysis of Online Child Sexual Exploitation Discourse?Darren Cook, Miri Zilka, Heidi DeSandre et al. · cambridge
Social media's growing popularity raises concerns around children's online safety. Interactions between minors and adults with predatory intentions is a particularly grave concern. Research into online sexual grooming has often relied on domain experts to manually annotate conversations, limiting both scale and scope. In this work, we test how well-automated methods can detect conversational behaviors and replace an expert human annotator. Informed by psychological theories of online grooming, we label $6772$ chat messages sent by child-sex offenders with one of eleven predatory behaviors. We train bag-of-words and natural language inference models to classify each behavior, and show that the best performing models classify behaviors in a manner that is consistent, but not on-par, with human annotation.
HCMay 11
How do people watch AI-generated videos of physical scenes?Danqing Shi, Lan Jiang, Katherine M. Collins et al.
The growing prevalence of realistic AI-generated videos on media platforms increasingly blurs the line between fact and fiction, eroding public trust. Understanding how people watch AI-generated videos offers a human-centered perspective for improving AI detection and guiding advancements in video generation. However, existing studies have not investigated human gaze behavior in response to AI-generated videos of physical scenes. Here, we collect and analyze the eye movements from 40 participants during video understanding and AI detection tasks involving a mix of real-world and AI-generated videos. We find that given the high realism of AI-generated videos, gaze behavior is driven less by the video's actual authenticity and more by the viewer's perception of its authenticity. Our results demonstrate that the mere awareness of potential AI generation may alter media consumption from passive viewing into an active search for anomalies.
CYJul 7, 2023
AI and the EU Digital Markets Act: Addressing the Risks of Bigness in Generative AIAyse Gizem Yasar, Andrew Chong, Evan Dong et al.
As AI technology advances rapidly, concerns over the risks of bigness in digital markets are also growing. The EU's Digital Markets Act (DMA) aims to address these risks. Still, the current framework may not adequately cover generative AI systems that could become gateways for AI-based services. This paper argues for integrating certain AI software as core platform services and classifying certain developers as gatekeepers under the DMA. We also propose an assessment of gatekeeper obligations to ensure they cover generative AI services. As the EU considers generative AI-specific rules and possible DMA amendments, this paper provides insights towards diversity and openness in generative AI services.
CYApr 30, 2024
Reimagining AI in Social Work: Practitioner Perspectives on Incorporating Technology in their PracticeKatie Wassal, Carolyn Ashurst, Jiri Hron et al.
There has been a surge in the number and type of AI tools being tested and deployed within both national and local government in the UK, including within the social care sector. Given the many ongoing and planned future developments, the time is ripe to review and reflect on the state of AI in social care. We do so by conducting semi-structured interviews with UK-based social work professionals about their experiences and opinions of past and current AI systems. Our aim is to understand what systems would practitioners like to see developed and how. We find that all our interviewees had overwhelmingly negative past experiences of technology in social care, unanimous aversion to algorithmic decision systems in particular, but also strong interest in AI applications that could allow them to spend less time on administrative tasks. In response to our findings, we offer a series of concrete recommendations, which include commitment to participatory design, as well as the necessity of regaining practitioner trust.
LGOct 2, 2025
Learning Pareto-Optimal Pandemic Intervention Policies with MORLMarian Chen, Miri Zilka
The COVID-19 pandemic underscored a critical need for intervention strategies that balance disease containment with socioeconomic stability. We approach this challenge by designing a framework for modeling and evaluating disease-spread prevention strategies. Our framework leverages multi-objective reinforcement learning (MORL) - a formulation necessitated by competing objectives - combined with a new stochastic differential equation (SDE) pandemic simulator, calibrated and validated against global COVID-19 data. Our simulator reproduces national-scale pandemic dynamics with orders of magnitude higher fidelity than other models commonly used in reinforcement learning (RL) approaches to pandemic intervention. Training a Pareto-Conditioned Network (PCN) agent on this simulator, we illustrate the direct policy trade-offs between epidemiological control and economic stability for COVID-19. Furthermore, we demonstrate the framework's generality by extending it to pathogens with different epidemiological profiles, such as polio and influenza, and show how these profiles lead the agent to discover fundamentally different intervention policies. To ground our work in contemporary policymaking challenges, we apply the model to measles outbreaks, quantifying how a modest 5% drop in vaccination coverage necessitates significantly more stringent and costly interventions to curb disease spread. This work provides a robust and adaptable framework to support transparent, evidence-based policymaking for mitigating public health crises.