Matrix Completion and Performance Guarantees for Single Individual Haplotyping
This addresses the challenge of reconstructing genetic variations from sequencing data for diploid organisms like humans, representing an incremental improvement in computational biology.
The paper tackles the NP-hard problem of single individual haplotyping by formulating it as binary matrix factorization and solving it with alternating minimization, achieving theoretical error bounds and outperforming existing methods on synthetic and real-world datasets.
Single individual haplotyping is an NP-hard problem that emerges when attempting to reconstruct an organism's inherited genetic variations using data typically generated by high-throughput DNA sequencing platforms. Genomes of diploid organisms, including humans, are organized into homologous pairs of chromosomes that differ from each other in a relatively small number of variant positions. Haplotypes are ordered sequences of the nucleotides in the variant positions of the chromosomes in a homologous pair; for diploids, haplotypes associated with a pair of chromosomes may be conveniently represented by means of complementary binary sequences. In this paper, we consider a binary matrix factorization formulation of the single individual haplotyping problem and efficiently solve it by means of alternating minimization. We analyze the convergence properties of the alternating minimization algorithm and establish theoretical bounds for the achievable haplotype reconstruction error. The proposed technique is shown to outperform existing methods when applied to synthetic as well as real-world Fosmid-based HapMap NA12878 datasets.