AIFeb 10, 2025
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI EvaluationMaria Eriksson, Erasmo Purificato, Arman Noroozian et al.
Quantitative Artificial Intelligence (AI) Benchmarks have emerged as fundamental tools for evaluating the performance, capability, and safety of AI models and systems. Currently, they shape the direction of AI development and are playing an increasingly prominent role in regulatory frameworks. As their influence grows, however, so too does concerns about how and with what effects they evaluate highly sensitive topics such as capabilities, including high-impact capabilities, safety and systemic risks. This paper presents an interdisciplinary meta-review of about 100 studies that discuss shortcomings in quantitative benchmarking practices, published in the last 10 years. It brings together many fine-grained issues in the design and application of benchmarks (such as biases in dataset creation, inadequate documentation, data contamination, and failures to distinguish signal from noise) with broader sociotechnical issues (such as an over-focus on evaluating text-based AI models according to one-time testing logic that fails to account for how AI models are increasingly multimodal and interact with humans and other technical systems). Our review also highlights a series of systemic flaws in current benchmarking practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results. Furthermore, it underscores how benchmark practices are fundamentally shaped by cultural, commercial and competitive dynamics that often prioritise state-of-the-art performance at the expense of broader societal concerns. By providing an overview of risks associated with existing benchmarking procedures, we problematise disproportionate trust placed in benchmarks and contribute to ongoing efforts to improve the accountability and relevance of quantitative AI benchmarks within the complexities of real-world scenarios.
CRAug 22, 2017
Herding Vulnerable Cats: A Statistical Approach to Disentangle Joint Responsibility for Web Security in Shared HostingSamaneh Tajalizadehkhoob, Tom van Goethem, Maciej Korczyński et al.
Hosting providers play a key role in fighting web compromise, but their ability to prevent abuse is constrained by the security practices of their own customers. {\em Shared} hosting, offers a unique perspective since customers operate under restricted privileges and providers retain more control over configurations. We present the first empirical analysis of the distribution of web security features and software patching practices in shared hosting providers, the influence of providers on these security practices, and their impact on web compromise rates. We construct provider-level features on the global market for shared hosting -- containing 1,259 providers -- by gathering indicators from 442,684 domains. Exploratory factor analysis of 15 indicators identifies four main latent factors that capture security efforts: content security, webmaster security, web infrastructure security and web application security. We confirm, via a fixed-effect regression model, that providers exert significant influence over the latter two factors, which are both related to the software stack in their hosting environment. Finally, by means of GLM regression analysis of these factors on phishing and malware abuse, we show that the four security and software patching factors explain between 10\% and 19\% of the variance in abuse at providers, after controlling for size. For web-application security for instance, we found that when a provider moves from the bottom 10\% to the best-performing 10\%, it would experience 4 times fewer phishing incidents. We show that providers have influence over patch levels--even higher in the stack, where CMSes can run as client-side software--and that this influence is tied to a substantial reduction in abuse levels.
CRDec 12, 2016
Developing Security Reputation Metrics for Hosting ProvidersArman Noroozian, Maciej Korczyński, Samaneh TajalizadehKhoob et al.
Research into cybercrime often points to concentrations of abuse at certain hosting providers. The implication is that these providers are worse in terms of security; some are considered `bad' or even `bullet proof'. Remarkably little work exists on systematically comparing the security performance of providers. Existing metrics typically count instances of abuse and sometimes normalize these counts by taking into account the advertised address space of the provider. None of these attempts have worked through the serious methodological challenges that plague metric design. In this paper we present a systematic approach for metrics development and identify the main challenges: (i) identification of providers, (ii) abuse data coverage and quality, (iii) normalization, (iv) aggregation and (v) metric interpretation. We describe a pragmatic approach to deal with these challenges. In the process, we answer an urgent question posed to us by the Dutch police: `which are the worst providers in our jurisdiction?'. Notwithstanding their limitations, there is a clear need for security metrics for hosting providers in the fight against cybercrime.