DNA-GCN: Graph convolutional networks for predicting DNA-protein binding
This work addresses a classic bioinformatics problem for researchers in genomics, but it is incremental as it applies an existing graph-based method to a new domain without major breakthroughs.
The authors tackled the problem of predicting DNA-protein binding by proposing DNA-GCN, a graph convolutional network that models sequence data as a k-mer graph, achieving competitive performance on 50 ENCODE datasets compared to baseline methods.
Predicting DNA-protein binding is an important and classic problem in bioinformatics. Convolutional neural networks have outperformed conventional methods in modeling the sequence specificity of DNA-protein binding. However, none of the studies has utilized graph convolutional networks for motif inference. In this work, we propose to use graph convolutional networks for motif inference. We build a sequence k-mer graph for the whole dataset based on k-mer co-occurrence and k-mer sequence relationship and then learn DNA Graph Convolutional Network (DNA-GCN) for the whole dataset. Our DNA-GCN is initialized with a one-hot representation for all nodes, and it then jointly learns the embeddings for both k-mers and sequences, as supervised by the known labels of sequences. We evaluate our model on 50 datasets from ENCODE. DNA-GCN shows its competitive performance compared with the baseline model. Besides, we analyze our model and design several different architectures to help fit different datasets.