DSApr 19

Optimal Phylogenetic Reconstruction from Sampled Quartets

Dionysis Arvanitakis, Vaggos Chatziafratis, Yiyuan Luo, Konstantin Makarychev

arXiv:2604.174619.4h-index: 12

Predicted impact top 81% in DS · last 90 daysOriginality Highly original

AI Analysis

This work solves the open problem of sample complexity and efficient reconstruction for phylogenetic tree learning from quartets, offering optimal results for a fundamental problem in computational biology.

The paper provides an optimal algorithm for reconstructing a phylogenetic tree from a random sample of Θ(n) noisy quartets, matching the information-theoretic lower bound. The algorithm achieves a (1-ε)-approximation and recovers a tree close to the ground truth in quartet distance.

Quartet Reconstruction, the task of recovering a phylogenetic tree from smaller trees on four species called \textit{quartets}, is a well-studied problem in theoretical computer science with far-reaching connections to statistics, graph theory and biology. Given a random sample containing $m$ noisy quartets, labeled by an unknown ground-truth tree $T$ on $n$ taxa, we want to output a tree $\widehat T$ that is \textit{close} to $T$ in terms of quartet distance and can predict unseen quartets. Unfortunately, the empirical risk minimizer corresponds to the $\mathsf{NP}$-hard problem of finding a tree that maximizes agreements with the sampled quartets, and earlier works in approximation algorithms gave $(1-\eps)$-approximation schemes (PTAS) for dense instances with $m=Θ(n^4)$ quartets, or for $m=Θ(n^2\log n)$ quartets \textit{randomly} sampled from $T$. Prior to our work, it was unknown how many samples are information-theoretically required to learn the tree, and whether there is an efficient reconstruction algorithm. We present optimal results for reconstructing an unknown phylogenetic tree $T$ from a random sample of $m=Θ(n)$ quartets, corrupted under the Random Classification Noise (RCN) model. This matches the $Ω(n)$ lower bound required for any meaningful tree reconstruction. Our contribution is twofold: first, we give a tree reconstruction algorithm that, not only achieves a $(1-\eps)$-approximation, but most importantly \textit{recovers} a tree close to $T$ in quartet distance; second, we show a new $Θ(n)$ bound on the Natarajan dimension of phylogenies (an analog of VC dimension in multiclass classification). Our analysis relies on a new \textit{Quartet-based Embedding and Detection} procedure that identifies and removes well-clustered subtrees from the (unknown) ground-truth $T$ via semidefinite programming.

View on arXiv PDF

Similar