Generalising sequence models for epigenome predictions with tissue and assay embeddings
This work addresses a bottleneck in regulatory genomics by enabling more accurate in silico predictions across diverse experimental conditions, representing a novel method for a known limitation.
The paper tackled the problem of poor contextual information usage in sequence models for epigenetic profile prediction, which limited inference on many tissue and assay pairs, and demonstrated strong correlation across experimental conditions by integrating tissue and assay embeddings into a Contextualised Genomic Network, exceeding state-of-the-art in multiple settings with rigorous validation.
Sequence modelling approaches for epigenetic profile prediction have recently expanded in terms of sequence length, model size, and profile diversity. However, current models cannot infer on many experimentally feasible tissue and assay pairs due to poor usage of contextual information, limiting $\textit{in silico}$ understanding of regulatory genomics. We demonstrate that strong correlation can be achieved across a large range of experimental conditions by integrating tissue and assay embeddings into a Contextualised Genomic Network (CGN). In contrast to previous approaches, we enhance long-range sequence embeddings with contextual information in the input space, rather than expanding the output space. We exhibit the efficacy of our approach across a broad set of epigenetic profiles and provide the first insights into the effect of genetic variants on epigenetic sequence model training. Our general approach to context integration exceeds state of the art in multiple settings while employing a more rigorous validation procedure.