Read classification using semi-supervised deep learning
This work addresses a domain-specific bottleneck in bioinformatics by facilitating genome assembly with reduced manual labeling effort, though it appears incremental as it builds on existing semi-supervised and deep learning techniques.
The paper tackles the problem of detecting specific types of reads that hinder de novo genome assembly by proposing a semi-supervised deep learning method that analyzes coverage graphs as 1D-signals, achieving improved assembly quality with minimal labeled data.
In this paper, we propose a semi-supervised deep learning method for detecting the specific types of reads that impede the de novo genome assembly process. Instead of dealing directly with sequenced reads, we analyze their coverage graphs converted to 1D-signals. We noticed that specific signal patterns occur in each relevant class of reads. Semi-supervised approach is chosen because manually labelling the data is a very slow and tedious process, so our goal was to facilitate the assembly process with as little labeled data as possible. We tested two models to learn patterns in the coverage graphs: M1+M2 and semi-GAN. We evaluated the performance of each model based on a manually labeled dataset that comprises various reads from multiple reference genomes with respect to the number of labeled examples that were used during the training process. In addition, we embedded our detection in the assembly process which improved the quality of assemblies.