Neural network facilitated ab initio derivation of linear formula: A case study on formulating the relationship between DNA motifs and gene expression
This work addresses the need for interpretable models in biology to uncover formulas representing biological laws, though it is incremental as it builds on existing neural network methods.
The authors tackled the problem of deriving interpretable linear formulas from biological data by proposing a framework that uses an interpretable neural network to predict gene expression from promoter sequences, achieving performance comparable to deep neural networks and identifying 300 motifs with regulatory roles across 154 cell types.
Developing models with high interpretability and even deriving formulas to quantify relationships between biological data is an emerging need. We propose here a framework for ab initio derivation of sequence motifs and linear formula using a new approach based on the interpretable neural network model called contextual regression model. We showed that this linear model could predict gene expression levels using promoter sequences with a performance comparable to deep neural network models. We uncovered a list of 300 motifs with important regulatory roles on gene expression and showed that they also had significant contributions to cell-type specific gene expression in 154 diverse cell types. This work illustrates the possibility of deriving formulas to represent biology laws that may not be easily elucidated. (https://github.com/Wang-lab-UCSD/Motif_Finding_Contextual_Regression)