SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model
This work addresses the need for efficient computational prediction of chromatin accessibility, which is crucial for understanding drug-DNA interactions and gene expression, but it is incremental as it builds on existing methods by adding language model features.
The authors tackled the problem of predicting chromatin accessibility by introducing SemanticCAP, a model that integrates features from a gene language model to capture contextual information in gene sequences, achieving better performance than existing systems on public benchmarks.
A large number of inorganic and organic compounds are able to bind DNA and form complexes, among which drug-related molecules are important. Chromatin accessibility changes not only directly affects drug-DNA interactions, but also promote or inhibit the expression of critical genes associated with drug resistance by affecting the DNA binding capacity of TFs and transcriptional regulators. However, Biological experimental techniques for measuring it are expensive and time consuming. In recent years, several kinds of computational methods have been proposed to identify accessible regions of the genome. Existing computational models mostly ignore the contextual information of bases in gene sequences. To address these issues, we proposed a new solution named SemanticCAP. It introduces a gene language model which models the context of gene sequences, thus being able to provide an effective representation of a certain site in gene sequences. Basically, we merge the features provided by the gene language model into our chromatin accessibility model. During the process, we designed some methods to make feature fusion smoother. Compared with other systems under public benchmarks, our model proved to have better performance.