Predicting Anti-microbial Resistance using Large Language Models
This work addresses the challenge of predicting antimicrobial resistance, which is crucial for public health, but it is incremental as it builds on existing language models with hybrid methods.
The study tackled the problem of classifying antibiotic resistance genes by using both nucleotide sequence and text language models, achieving better performance than a nucleotide sequence-only model in drug resistance class prediction.
During times of increasing antibiotic resistance and the spread of infectious diseases like COVID-19, it is important to classify genes related to antibiotic resistance. As natural language processing has advanced with transformer-based language models, many language models that learn characteristics of nucleotide sequences have also emerged. These models show good performance in classifying various features of nucleotide sequences. When classifying nucleotide sequences, not only the sequence itself, but also various background knowledge is utilized. In this study, we use not only a nucleotide sequence-based language model but also a text language model based on PubMed articles to reflect more biological background knowledge in the model. We propose a method to fine-tune the nucleotide sequence language model and the text language model based on various databases of antibiotic resistance genes. We also propose an LLM-based augmentation technique to supplement the data and an ensemble method to effectively combine the two models. We also propose a benchmark for evaluating the model. Our method achieved better performance than the nucleotide sequence language model in the drug resistance class prediction.