Lisa Soder

CY
h-index23
3papers
42citations
Novelty18%
AI Score31

3 Papers

CYNov 15, 2023
Towards Publicly Accountable Frontier LLMs: Building an External Scrutiny Ecosystem under the ASPIRE Framework

Markus Anderljung, Everett Thornton Smith, Joe O'Brien et al.

With the increasing integration of frontier large language models (LLMs) into society and the economy, decisions related to their training, deployment, and use have far-reaching implications. These decisions should not be left solely in the hands of frontier LLM developers. LLM users, civil society and policymakers need trustworthy sources of information to steer such decisions for the better. Involving outside actors in the evaluation of these systems - what we term 'external scrutiny' - via red-teaming, auditing, and external researcher access, offers a solution. Though there are encouraging signs of increasing external scrutiny of frontier LLMs, its success is not assured. In this paper, we survey six requirements for effective external scrutiny of frontier AI systems and organize them under the ASPIRE framework: Access, Searching attitude, Proportionality to the risks, Independence, Resources, and Expertise. We then illustrate how external scrutiny might function throughout the AI lifecycle and offer recommendations to policymakers.

69.4CYMay 13
Europe and the Geopolitics of AGI: The Need for a Preparedness Plan

Maximilian Negele, Daan Juijn, Afek Shamir et al.

Artificial general intelligence (AGI)--defined here as AI systems that match or exceed humans at most economically useful cognitive work--has moved from speculation to the centre of political and strategic debate. This paper examines three questions: how soon AGI might emerge, how it could reshape geopolitics, and whether Europe is adequately prepared. Drawing on empirical trends in AI capabilities, expert forecasting surveys, and policy analysis, we find that a plausible window for AGI emergence falls between 2030 and 2040, or potentially earlier, though substantial uncertainty remains. Our analysis of the geopolitical implications suggests that AGI could fundamentally alter the global distribution of economic and military power, intensify interstate competition, and strain existing governance frameworks. Assessing Europe's current positioning, we identify critical gaps: limited strategic awareness of frontier AI progress, structural weaknesses in compute infrastructure and talent retention, low rates of industrial AI adoption, and fragmented policy responses at both EU and Member State levels that do not match the potential scale of disruption.These findings point to a need for a coordinated European preparedness agenda. We outline policy options centred on building institutional capacity for AGI situational awareness, strengthening Europe's position in the AI value chain, and developing frameworks for international stability in an era of increasingly capable AI systems.

AIDec 7, 2024
More than Marketing? On the Information Value of AI Benchmarks for Practitioners

Amelia Hardy, Anka Reuel, Kiana Jafari Meimandi et al.

Public AI benchmark results are widely broadcast by model developers as indicators of model quality within a growing and competitive market. However, these advertised scores do not necessarily reflect the traits of interest to those who will ultimately apply AI models. In this paper, we seek to understand if and how AI benchmarks are used to inform decision-making. Based on the analyses of interviews with 19 individuals who have used, or decided against using, benchmarks in their day-to-day work, we find that across these settings, participants use benchmarks as a signal of relative performance difference between models. However, whether this signal was considered a definitive sign of model superiority, sufficient for downstream decisions, varied. In academia, public benchmarks were generally viewed as suitable measures for capturing research progress. By contrast, in both product and policy, benchmarks -- even those developed internally for specific tasks -- were often found to be inadequate for informing substantive decisions. Of the benchmarks deemed unsatisfactory, respondents reported that their goals were neither well-defined nor reflective of real-world use. Based on the study results, we conclude that effective benchmarks should provide meaningful, real-world evaluations, incorporate domain expertise, and maintain transparency in scope and goals. They must capture diverse, task-relevant capabilities, be challenging enough to avoid quick saturation, and account for trade-offs in model performance rather than relying on a single score. Additionally, proprietary data collection and contamination prevention are critical for producing reliable and actionable results. By adhering to these criteria, benchmarks can move beyond mere marketing tricks into robust evaluative frameworks.