Prediction by Machine Learning Analysis of Genomic Data Phenotypic Frost Tolerance in Perccottus glenii
This work addresses the challenge of efficiently analyzing genomic data for understanding freeze tolerance in a specific fish species, offering a faster and more accurate alternative to traditional methods, though it is incremental in applying existing ML techniques to a new biological dataset.
The study tackled the problem of identifying genes linked to frost tolerance in Perccottus glenii by applying machine learning to genomic data, achieving a classification accuracy of 99.98% with a Random Forest model. This demonstrates that machine learning can effectively replace traditional manual analysis for this task.
Analysis of the genome sequence of Perccottus glenii, the only fish known to possess freeze tolerance, holds significant importance for understanding how organisms adapt to extreme environments, Traditional biological analysis methods are time-consuming and have limited accuracy, To address these issues, we will employ machine learning techniques to analyze the gene sequences of Perccottus glenii, with Neodontobutis hainanens as a comparative group, Firstly, we have proposed five gene sequence vectorization methods and a method for handling ultra-long gene sequences, We conducted a comparative study on the three vectorization methods: ordinal encoding, One-Hot encoding, and K-mer encoding, to identify the optimal encoding method, Secondly, we constructed four classification models: Random Forest, LightGBM, XGBoost, and Decision Tree, The dataset used by these classification models was extracted from the National Center for Biotechnology Information database, and we vectorized the sequence matrices using the optimal encoding method, K-mer, The Random Forest model, which is the optimal model, achieved a classification accuracy of up to 99, 98 , Lastly, we utilized SHAP values to conduct an interpretable analysis of the optimal classification model, Through ten-fold cross-validation and the AUC metric, we identified the top 10 features that contribute the most to the model's classification accuracy, This demonstrates that machine learning methods can effectively replace traditional manual analysis in identifying genes associated with the freeze tolerance phenotype in Perccottus glenii.