David Burstein

CLJun 2, 2025

Leveraging Natural Language Processing to Unravel the Mystery of Life: A Review of NLP Approaches in Genomics, Transcriptomics, and Proteomics

Ella Rannon, David Burstein

Natural Language Processing (NLP) has transformed various fields beyond linguistics by applying techniques originally developed for human language to the analysis of biological sequences. This review explores the application of NLP methods to biological sequence data, focusing on genomics, transcriptomics, and proteomics. We examine how various NLP methods, from classic approaches like word2vec to advanced models employing transformers and hyena operators, are being adapted to analyze DNA, RNA, protein sequences, and entire genomes. The review also examines tokenization strategies and model architectures, evaluating their strengths, limitations, and suitability for different biological tasks. We further cover recent advances in NLP applications for biological data, such as structure prediction, gene expression, and evolutionary analysis, highlighting the potential of these methods for extracting meaningful insights from large-scale genomic data. As language models continue to advance, their integration into bioinformatics holds immense promise for advancing our understanding of biological processes in all domains of life.

CRJul 18, 2015

Interception in Distance-Vector Routing Networks

David Burstein, Franklin Kenter, Jeremy Kun et al.

Despite the large effort devoted to cybersecurity research over the last decades, cyber intrusions and attacks are still increasing. With respect to routing networks, route hijacking has highlighted the need to reexamine the existing protocols that govern traffic routing. In particular, our pri- mary question is how the topology of a network affects the susceptibility of a routing protocol to endogenous route misdirection. In this paper we define and analyze an abstract model of traffic interception (i.e. eavesdropping) in distance-vector routing networks. Specifically, we study al- gorithms that measure the potential of groups of dishonest agents to divert traffic through their infrastructure under the constraint that messages must reach their intended destinations. We relate two variants of our model based on the allowed kinds of lies, define strategies for colluding agents, and prove optimality in special cases. In our main theorem we derive a provably optimal monitoring strategy for subsets of agents in which no two are adjacent, and we extend this strategy to the general case. Finally, we use our results to analyze the susceptibility of real and synthetic networks to endogenous traffic interception. In the Autonomous Systems (AS) graph of the United States, we show that compromising only 18 random nodes in the AS graph surprisingly captures 10% of all traffic paths in the network in expectation when a distance-vector routing protocol is in use.

David Burstein

2 Papers