QM AIMay 19

ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins

Yichen Zhou, Jonathan Golob, Amir Karimi, Stefan Bauer, Patrick Schwab

arXiv:2603.0674060.9h-index: 3

AI Analysis

Provides the first comprehensive benchmark for evaluating pLMs on viral proteins, addressing a critical gap for proactive pandemic preparedness.

ViroGym benchmarks protein language models on viral mutation prediction across 79 DMS assays, 21 influenza neutralisation tasks, and a SARS-CoV-2 pandemic prediction task, finding ProGen2 consistently outperforms others and that DMS/neutralisation performance predicts real-world emergence.

Protein language models (pLMs) have shown strong potential for zero-shot prediction of missense variant effects, yet systematic benchmarking on viral proteins remains limited, a critical gap given the need for proactive tools that can anticipate emerging mutations ahead of experimental validation. Here we introduce ViroGym, a comprehensive benchmark evaluating pLMs across three tasks: 79 deep mutational scanning (DMS) assays covering eukaryotic viruses with 552,065 mutated sequences across 7 phenotypic readouts, 21 influenza neutralisation tasks, and a real-world pandemic prediction task for SARS-CoV-2. We benchmark well-established pLMs on fitness landscapes, antigenic diversity, and pandemic forecasting, and find that the ProGen2 family consistently achieves the strongest performance across all three tasks. Crucially, DMS and neutralisation performance reliably identifies models that generalise to real-world emergence, even though the mutation sets they surface barely overlap, revealing that complementary in vitro benchmarks capture the evolutionary constraints needed for real-world mutation forecasting.

View on arXiv PDF

Similar