CELGJan 25, 2014

Identification of Protein Coding Regions in Genomic DNA Using Unsupervised FMACA Based Pattern Classifier

arXiv:1401.6484v117 citations
Originality Synthesis-oriented
AI Analysis

This work addresses a domain-specific problem in bioinformatics for identifying coding regions in DNA sequences, with incremental improvements in accuracy and scalability.

The paper tackles the problem of identifying protein-coding regions in genomic DNA by proposing an unsupervised Fuzzy Multiple Attractor Cellular Automata (FMACA) based pattern classifier, which achieves good classification accuracy and scalability for large datasets.

Genes carry the instructions for making proteins that are found in a cell as a specific sequence of nucleotides that are found in DNA molecules. But, the regions of these genes that code for proteins may occupy only a small region of the sequence. Identifying the coding regions play a vital role in understanding these genes. In this paper we propose a unsupervised Fuzzy Multiple Attractor Cellular Automata (FMCA) based pattern classifier to identify the coding region of a DNA sequence. We propose a distinct K-Means algorithm for designing FMACA classifier which is simple, efficient and produces more accurate classifier than that has previously been obtained for a range of different sequence lengths. Experimental results confirm the scalability of the proposed Unsupervised FCA based classifier to handle large volume of datasets irrespective of the number of classes, tuples and attributes. Good classification accuracy has been established.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes