Rahmad Akbar

BM
h-index30
5papers
91citations
Novelty39%
AI Score32

5 Papers

QMJul 3, 2022
Linguistically inspired roadmap for building biologically reliable protein language models

Mai Ha Vu, Rahmad Akbar, Philippe A. Robert et al.

Deep neural-network-based language models (LMs) are increasingly applied to large-scale protein sequence data to predict protein function. However, being largely black-box models and thus challenging to interpret, current protein LM approaches do not contribute to a fundamental understanding of sequence-function mappings, hindering rule-based biotherapeutic drug development. We argue that guidance drawn from linguistics, a field specialized in analytical rule extraction from natural language data, can aid with building more interpretable protein LMs that are more likely to learn relevant domain-specific rules. Differences between protein sequence data and linguistic sequence data require the integration of more domain-specific knowledge in protein LMs compared to natural language LMs. Here, we provide a linguistics-based roadmap for protein LM pipeline choices with regard to training data, tokenization, token embedding, sequence embedding, and model interpretation. Incorporating linguistic ideas into protein LMs enables the development of next-generation interpretable machine-learning models with the potential of uncovering the biological mechanisms underlying sequence-function relationships.

QMSep 26, 2022
ImmunoLingo: Linguistics-based formalization of the antibody language

Mai Ha Vu, Philippe A. Robert, Rahmad Akbar et al.

Apparent parallels between natural language and biological sequence have led to a recent surge in the application of deep language models (LMs) to the analysis of antibody and other biological sequences. However, a lack of a rigorous linguistic formalization of biological sequence languages, which would define basic components, such as lexicon (i.e., the discrete units of the language) and grammar (i.e., the rules that link sequence well-formedness, structure, and meaning) has led to largely domain-unspecific applications of LMs, which do not take into account the underlying structure of the biological sequences studied. A linguistic formalization, on the other hand, establishes linguistically-informed and thus domain-adapted components for LM applications. It would facilitate a better understanding of how differences and similarities between natural language and biological sequences influence the quality of LMs, which is crucial for the design of interpretable models with extractable sequence-functions relationship rules, such as the ones underlying the antibody specificity prediction problem. Deciphering the rules of antibody specificity is crucial to accelerating rational and in silico biotherapeutic drug design. Here, we formalize the properties of the antibody language and thereby establish not only a foundation for the application of linguistic tools in adaptive immune receptor analysis but also for the systematic immunolinguistic studies of immune receptor specificity in general.

BMDec 2, 2022
Progress and Challenges for the Application of Machine Learning for Neglected Tropical Diseases

Chung Yuen Khew, Rahmad Akbar, Norfarhan Mohd. Assaad

Neglected tropical diseases (NTDs) continue to affect the livelihood of individuals in countries in the Southeast Asia and Western Pacific region. These diseases have been long existing and have caused devastating health problems and economic decline to people in low- and middle-income (developing) countries. An estimated 1.7 billion of the world's population suffer one or more NTDs annually, this puts approximately one in five individuals at risk for NTDs. In addition to health and social impact, NTDs inflict significant financial burden to patients, close relatives, and are responsible for billions of dollars lost in revenue from reduced labor productivity in developing countries alone. There is an urgent need to better improve the control and eradication or elimination efforts towards NTDs. This can be achieved by utilizing machine learning tools to better the surveillance, prediction and detection program, and combat NTDs through the discovery of new therapeutics against these pathogens. This review surveys the current application of machine learning tools for NTDs and the challenges to elevate the state-of-the-art of NTDs surveillance, management, and treatment.

LGJun 29, 2025
Attribution assignment for deep-generative sequence models enables interpretability analysis using positive-only data

Robert Frank, Michael Widrich, Rahmad Akbar et al.

Generative machine learning models offer a powerful framework for therapeutic design by efficiently exploring large spaces of biological sequences enriched for desirable properties. Unlike supervised learning methods, which require both positive and negative labeled data, generative models such as LSTMs can be trained solely on positively labeled sequences, for example, high-affinity antibodies. This is particularly advantageous in biological settings where negative data are scarce, unreliable, or biologically ill-defined. However, the lack of attribution methods for generative models has hindered the ability to extract interpretable biological insights from such models. To address this gap, we developed Generative Attribution Metric Analysis (GAMA), an attribution method for autoregressive generative models based on Integrated Gradients. We assessed GAMA using synthetic datasets with known ground truths to characterize its statistical behavior and validate its ability to recover biologically relevant features. We further demonstrated the utility of GAMA by applying it to experimental antibody-antigen binding data. GAMA enables model interpretability and the validation of generative sequence design strategies without the need for negative training data.

BMJan 29, 2022
AntBO: Towards Real-World Automated Antibody Design with Combinatorial Bayesian Optimisation

Asif Khan, Alexander I. Cowen-Rivers, Antoine Grosnit et al.

Antibodies are canonically Y-shaped multimeric proteins capable of highly specific molecular recognition. The CDRH3 region located at the tip of variable chains of an antibody dominates antigen-binding specificity. Therefore, it is a priority to design optimal antigen-specific CDRH3 regions to develop therapeutic antibodies. However, the combinatorial nature of CDRH3 sequence space makes it impossible to search for an optimal binding sequence exhaustively and efficiently using computational approaches. Here, we present \texttt{AntBO}: a combinatorial Bayesian optimisation framework enabling efficient \textit{in silico} design of the CDRH3 region. Ideally, antibodies are expected to have high target specificity and developability. We introduce a CDRH3 trust region that restricts the search to sequences with favourable developability scores to achieve this goal. For benchmarking, \texttt{AntBO} uses the \texttt{Absolut!} software suite as a black-box oracle to score the target specificity and affinity of designed antibodies \textit{in silico} in an unconstrained fashion~\citep{robert2021one}. The experiments performed for $159$ discretised antigens used in \texttt{Absolut!} demonstrate the benefit of \texttt{AntBO} in designing CDRH3 regions with diverse biophysical properties. In under $200$ calls to black-box oracle, \texttt{AntBO} can suggest antibody sequences that outperform the best binding sequence drawn from 6.9 million experimentally obtained CDRH3s and a commonly used genetic algorithm baseline. Additionally, \texttt{AntBO} finds very-high affinity CDRH3 sequences in only 38 protein designs whilst requiring no domain knowledge. We conclude \texttt{AntBO} brings automated antibody design methods closer to what is practically viable for in vitro experimentation.