CLGNDec 10, 2024

Can linguists better understand DNA?

arXiv:2412.07678v31 citationsh-index: 1
Originality Incremental advance
AI Analysis

It addresses the under-explored problem of cross-domain transfer between natural and biological languages for researchers in computational linguistics and bioinformatics, though it is incremental in scope.

This study tackled the problem of transferring capabilities from natural language to gene sequences by constructing DNA-pair and DNA-protein-pair classification tasks, finding that models like GPT-2-small achieved 78% accuracy on DNA-pair classification after fine-tuning on English data, while BERT reached 89% precision, but performance was near-random on the more complex DNA-protein-pair task.

Multilingual transfer ability, which reflects how well models fine-tuned on one source language can be applied to other languages, has been well studied in multilingual pre-trained models. However, the existence of such capability transfer between natural language and gene sequences/languages remains under explored.This study addresses this gap by drawing inspiration from the sentence-pair classification task used for evaluating sentence similarity in natural language. We constructed two analogous tasks: DNA-pair classification(DNA sequence similarity) and DNA-protein-pair classification(gene coding determination). These tasks were designed to validate the transferability of capabilities from natural language to gene sequences. Even a small-scale pre-trained model like GPT-2-small, which was pre-trained on English, achieved an accuracy of 78% on the DNA-pair classification task after being fine-tuned on English sentence-pair classification data(XTREME PAWS-X). While training a BERT model on multilingual text, the precision reached 89%. On the more complex DNA-protein-pair classification task, however, the model's output was barely distinguishable from random output.Experimental validation has confirmed that the transfer of capabilities from natural language to biological language is unequivocally present. Building on this foundation, we have also investigated the impact of model parameter scale and pre-training on this capability transfer. We provide recommendations for facilitating the transfer of capabilities from natural language to genetic language,as well as new approaches for conducting biological research based on this capability.This study offers an intriguing new perspective on exploring the relationship between natural language and genetic language.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes