LGCRJul 11, 2025

Exploiting Leaderboards for Large-Scale Distribution of Malicious Models

arXiv:2507.08983v15 citationsh-index: 11
Originality Highly original
AI Analysis

This exposes a critical vulnerability in the ML ecosystem, urging redesigns of leaderboard evaluations to detect malicious models, with broad security implications for the community.

The paper tackles the problem of adversaries using model leaderboards to distribute poisoned models at scale, demonstrating that TrojanClimb can embed harmful functionalities like backdoors while achieving high rankings across text-embedding, text-generation, text-to-speech, and text-to-image modalities.

While poisoning attacks on machine learning models have been extensively studied, the mechanisms by which adversaries can distribute poisoned models at scale remain largely unexplored. In this paper, we shed light on how model leaderboards -- ranked platforms for model discovery and evaluation -- can serve as a powerful channel for adversaries for stealthy large-scale distribution of poisoned models. We present TrojanClimb, a general framework that enables injection of malicious behaviors while maintaining competitive leaderboard performance. We demonstrate its effectiveness across four diverse modalities: text-embedding, text-generation, text-to-speech and text-to-image, showing that adversaries can successfully achieve high leaderboard rankings while embedding arbitrary harmful functionalities, from backdoors to bias injection. Our findings reveal a significant vulnerability in the machine learning ecosystem, highlighting the urgent need to redesign leaderboard evaluation mechanisms to detect and filter malicious (e.g., poisoned) models, while exposing broader security implications for the machine learning community regarding the risks of adopting models from unverified sources.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes