Stephen Young

LG
h-index13
5papers
33citations
Novelty53%
AI Score36

5 Papers

GNAug 22, 2023
Generalising sequence models for epigenome predictions with tissue and assay embeddings

Jacob Deasy, Ron Schwessinger, Ferran Gonzalez et al.

Sequence modelling approaches for epigenetic profile prediction have recently expanded in terms of sequence length, model size, and profile diversity. However, current models cannot infer on many experimentally feasible tissue and assay pairs due to poor usage of contextual information, limiting $\textit{in silico}$ understanding of regulatory genomics. We demonstrate that strong correlation can be achieved across a large range of experimental conditions by integrating tissue and assay embeddings into a Contextualised Genomic Network (CGN). In contrast to previous approaches, we enhance long-range sequence embeddings with contextual information in the input space, rather than expanding the output space. We exhibit the efficacy of our approach across a broad set of epigenetic profiles and provide the first insights into the effect of genetic variants on epigenetic sequence model training. Our general approach to context integration exceeds state of the art in multiple settings while employing a more rigorous validation procedure.

GNMay 10, 2024
Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction

Aleix Lafita, Ferran Gonzalez, Mahmoud Hossam et al.

Protein Language Models (PLMs) have emerged as performant and scalable tools for predicting the functional impact and clinical significance of protein-coding variants, but they still lag experimental accuracy. Here, we present a novel fine-tuning approach to improve the performance of PLMs with experimental maps of variant effects from Deep Mutational Scanning (DMS) assays using a Normalised Log-odds Ratio (NLR) head. We find consistent improvements in a held-out protein test set, and on independent DMS and clinical variant annotation benchmarks from ProteinGym and ClinVar. These findings demonstrate that DMS is a promising source of sequence diversity and supervised training data for improving the performance of PLMs for variant effect prediction.

LGOct 2, 2025
Uncertainty-Guided Model Selection for Tabular Foundation Models in Biomolecule Efficacy Prediction

Jie Li, Andrew McCarthy, Zhizhuo Zhang et al.

In-context learners like TabPFN are promising for biomolecule efficacy prediction, where established molecular feature sets and relevant experimental results can serve as powerful contextual examples. However, their performance is highly sensitive to the provided context, making strategies like post-hoc ensembling of models trained on different data subsets a viable approach. An open question is how to select the best models for the ensemble without access to ground truth labels. In this study, we investigate an uncertainty-guided strategy for model selection. We demonstrate on an siRNA knockdown efficacy task that a TabPFN model using straightforward sequence-based features can surpass specialized state-of-the-art predictors. We also show that the model's predicted inter-quantile range (IQR), a measure of its uncertainty, has a negative correlation with true prediction error. We developed the OligoICP method, which selects and averages an ensemble of models with the lowest mean IQR for siRNA efficacy prediction, achieving superior performance compared to naive ensembling or using a single model trained on all available data. This finding highlights model uncertainty as a powerful, label-free heuristic for optimizing biomolecule efficacy predictions.

SPMay 28, 2025
Temporal Convolutional Autoencoder for Interference Mitigation in FMCW Radar Altimeters

Charles E. Thornton, Jamie Sloop, Samuel Brown et al.

We investigate the end-to-end altitude estimation performance of a convolutional autoencoder-based interference mitigation approach for frequency-modulated continuous-wave (FMCW) radar altimeters. Specifically, we show that a Temporal Convolutional Network (TCN) autoencoder effectively exploits temporal correlations in the received signal, providing superior interference suppression compared to a Least Mean Squares (LMS) adaptive filter. Unlike existing approaches, the present method operates directly on the received FMCW signal. Additionally, we identify key challenges in applying deep learning to wideband FMCW interference mitigation and outline directions for future research to enhance real-time feasibility and generalization to arbitrary interference conditions.

LGDec 14, 2021
Epigenomic language models powered by Cerebras

Meredith V. Trotter, Cuong Q. Nguyen, Stephen Young et al.

Large scale self-supervised pre-training of Transformer language models has advanced the field of Natural Language Processing and shown promise in cross-application to the biological `languages' of proteins and DNA. Learning effective representations of DNA sequences using large genomic sequence corpuses may accelerate the development of models of gene regulation and function through transfer learning. However, to accurately model cell type-specific gene regulation and function, it is necessary to consider not only the information contained in DNA nucleotide sequences, which is mostly invariant between cell types, but also how the local chemical and structural `epigenetic state' of chromosomes varies between cell types. Here, we introduce a Bidirectional Encoder Representations from Transformers (BERT) model that learns representations based on both DNA sequence and paired epigenetic state inputs, which we call Epigenomic BERT (or EBERT). We pre-train EBERT with a masked language model objective across the entire human genome and across 127 cell types. Training this complex model with a previously prohibitively large dataset was made possible for the first time by a partnership with Cerebras Systems, whose CS-1 system powered all pre-training experiments. We show EBERT's transfer learning potential by demonstrating strong performance on a cell type-specific transcription factor binding prediction task. Our fine-tuned model exceeds state of the art performance on 4 of 13 evaluation datasets from ENCODE-DREAM benchmarks and earns an overall rank of 3rd on the challenge leaderboard. We explore how the inclusion of epigenetic data and task specific feature augmentation impact transfer learning performance.