BMLGJun 21, 2024

CARE: a Benchmark Suite for the Classification and Retrieval of Enzymes

arXiv:2406.15669v327 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This work addresses a gap in bioinformatics by providing a standardized benchmark for researchers developing ML methods for enzyme function prediction, though it is incremental as it formalizes existing tasks rather than introducing new paradigms.

The authors tackled the lack of standardized benchmarks for evaluating machine learning methods that predict enzyme function from sequence by introducing CARE, a benchmark suite for enzyme classification and retrieval, which includes tasks for EC number classification and retrieval with designed train-test splits for out-of-distribution generalization and provides baselines such as CREEP for the retrieval task.

Enzymes are important proteins that catalyze chemical reactions. In recent years, machine learning methods have emerged to predict enzyme function from sequence; however, there are no standardized benchmarks to evaluate these methods. We introduce CARE, a benchmark and dataset suite for the Classification And Retrieval of Enzymes (CARE). CARE centers on two tasks: (1) classification of a protein sequence by its enzyme commission (EC) number and (2) retrieval of an EC number given a chemical reaction. For each task, we design train-test splits to evaluate different kinds of out-of-distribution generalization that are relevant to real use cases. For the classification task, we provide baselines for state-of-the-art methods. Because the retrieval task has not been previously formalized, we propose a method called Contrastive Reaction-EnzymE Pretraining (CREEP) as one of the first baselines for this task and compare it to the recent method, CLIPZyme. CARE is available at https://github.com/jsunn-y/CARE/.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes