KinForm: Kinetics Informed Feature Optimised Representation Models for Enzyme $k_{cat}$ and $K_{M}$ Prediction
This work addresses the challenge of limited experimental data for enzyme kinetics prediction, offering a method that enhances accuracy and generalization for researchers in computational biology and enzyme engineering, though it is incremental in nature.
The paper tackles the problem of predicting enzyme kinetic parameters (k_cat and K_M) by introducing KinForm, a framework that optimizes protein feature representations through weighted pooling of residue-level embeddings, dimensionality reduction, and data rebalancing, resulting in improved predictive accuracy and generalization, especially for low sequence similarity proteins, as demonstrated on two benchmark datasets.
Kinetic parameters such as the turnover number ($k_{cat}$) and Michaelis constant ($K_{\mathrm{M}}$) are essential for modelling enzymatic activity but experimental data remains limited in scale and diversity. Previous methods for predicting enzyme kinetics typically use mean-pooled residue embeddings from a single protein language model to represent the protein. We present KinForm, a machine learning framework designed to improve predictive accuracy and generalisation for kinetic parameters by optimising protein feature representations. KinForm combines several residue-level embeddings (Evolutionary Scale Modeling Cambrian, Evolutionary Scale Modeling 2, and ProtT5-XL-UniRef50), taken from empirically selected intermediate transformer layers and applies weighted pooling based on per-residue binding-site probability. To counter the resulting high dimensionality, we apply dimensionality reduction using principal--component analysis (PCA) on concatenated protein features, and rebalance the training data via a similarity-based oversampling strategy. KinForm outperforms baseline methods on two benchmark datasets. Improvements are most pronounced in low sequence similarity bins. We observe improvements from binding-site probability pooling, intermediate-layer selection, PCA, and oversampling of low-identity proteins. We also find that removing sequence overlap between folds provides a more realistic evaluation of generalisation and should be the standard over random splitting when benchmarking kinetic prediction models.