GN LG QMSep 26, 2019

Deep Learning and Random Forest-Based Augmentation of sRNA Expression Profiles

Jelena Fiosina, Maksims Fiosins, Stefan Bonn

arXiv:1909.11943v16 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the need for improved data interoperability and reusability in RNA expression research by providing an automatic augmentation method, though it is incremental as it applies existing machine learning techniques to a specific domain problem.

The paper tackles the problem of automatically generating accurate annotations for small RNA-seq expression data, which lacks structured labels, by formulating it as a classification task using deep learning and random forest methods. The result shows high prediction accuracies, such as 98% for tissue groups and 77% for sex, with deep learning outperforming random forest, especially on unseen datasets.

The lack of well-structured annotations in a growing amount of RNA expression data complicates data interoperability and reusability. Commonly - used text mining methods extract annotations from existing unstructured data descriptions and often provide inaccurate output that requires manual curation. Automatic data-based augmentation (generation of annotations on the base of expression data) can considerably improve the annotation quality and has not been well-studied. We formulate an automatic augmentation of small RNA-seq expression data as a classification problem and investigate deep learning (DL) and random forest (RF) approaches to solve it. We generate tissue and sex annotations from small RNA-seq expression data for tissues and cell lines of homo sapiens. We validate our approach on 4243 annotated small RNA-seq samples from the Small RNA Expression Atlas (SEA) database. The average prediction accuracy for tissue groups is 98% (DL), for tissues - 96.5% (DL), and for sex - 77% (DL). The "one dataset out" average accuracy for tissue group prediction is 83% (DL) and 59% (RF). On average, DL provides better results as compared to RF, and considerably improves classification performance for 'unseen' datasets.

View on arXiv PDF

Similar