GN AI LGOct 11, 2021

Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types

Shentong Mo, Xi Fu, Chenyang Hong, Yizhen Chen, Yuxuan Zheng, Xiangru Tang, Zhiqiang Shen, Eric P Xing, Yanyan Lan

arXiv:2110.05231v26.627 citations

Originality Highly original

AI Analysis

This work addresses the lack of generalizability in deep learning methods for regulatory genome modeling across cell types, which is important for genome biology research and applications such as disease risk estimation.

The paper tackled the problem of modeling regulatory genome interactions across cell types by proposing GeneBERT, a multi-modal self-supervised pre-training approach using 1D sequences and 2D matrices, which improved performance on downstream tasks like promoter classification and transcription factor binding site prediction across different cell types.

In the genome biology research, regulatory genome modeling is an important topic for many regulatory downstream tasks, such as promoter classification, transaction factor binding sites prediction. The core problem is to model how regulatory elements interact with each other and its variability across different cell types. However, current deep learning methods often focus on modeling genome sequences of a fixed set of cell types and do not account for the interaction between multiple regulatory elements, making them only perform well on the cell types in the training set and lack the generalizability required in biological applications. In this work, we propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT. Specifically, we simultaneously take the 1d sequence of genome data and a 2d matrix of (transcription factors x regions) as the input, where three pre-training tasks are proposed to improve the robustness and generalizability of our model. We pre-train our model on the ATAC-seq dataset with 17 million genome sequences. We evaluate our GeneBERT on regulatory downstream tasks across different cell types, including promoter classification, transaction factor binding sites prediction, disease risk estimation, and splicing sites prediction. Extensive experiments demonstrate the effectiveness of multi-modal and self-supervised pre-training for large-scale regulatory genomics data.

View on arXiv PDF

Similar