MLLGMay 8, 2018

Efficient online learning for large-scale peptide identification

arXiv:1805.03006v1Has Code
Originality Incremental advance
AI Analysis

This work addresses efficiency and accuracy issues in peptide identification for bioinformatics researchers, though it is incremental as it builds on existing ranking methods.

The paper tackles the challenge of improving peptide identification accuracy on large-scale and hard datasets with high false positive rates by introducing an online learning method called OLCS-Ranker, which reduces false discovery rates and is 15-85 times faster than existing methods.

Motivation: Post-database searching is a key procedure in peptide dentification with tandem mass spectrometry (MS/MS) strategies for refining peptide-spectrum matches (PSMs) generated by database search engines. Although many statistical and machine learning-based methods have been developed to improve the accuracy of peptide identification, the challenge remains on large-scale datasets and datasets with an extremely large proportion of false positives (hard datasets). A more efficient learning strategy is required for improving the performance of peptide identification on challenging datasets. Results: In this work, we present an online learning method to conquer the challenges remained for exiting peptide identification algorithms. We propose a cost-sensitive learning model by using different loss functions for decoy and target PSMs respectively. A larger penalty for wrongly selecting decoy PSMs than that for target PSMs, and thus the new model can reduce its false discovery rate on hard datasets. Also, we design an online learning algorithm, OLCS-Ranker, to solve the proposed learning model. Rather than taking all training data samples all at once, OLCS-Ranker iteratively feeds in only one training sample into the learning model at each round. As a result, the memory requirement is significantly reduced for large-scale problems. Experimental studies show that OLCS-Ranker outperforms benchmark methods, such as CRanker and Batch-CS-Ranker, in terms of accuracy and stability. Furthermore, OLCS-Ranker is 15--85 times faster than CRanker method on large datasets. Availability and implementation: OLCS-Ranker software is available at no charge for non-commercial use at https://github.com/Isaac-QiXing/CRanker.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes