Distance-based species tree estimation: information-theoretic trade-off between number of loci and sequence length under the coalescent
This addresses the challenge of efficient species tree estimation for evolutionary biologists by providing theoretical bounds on data requirements, though it is incremental as it builds on existing coalescent and signal detection frameworks.
The paper tackles the problem of reconstructing a phylogeny from multiple genes under the multispecies coalescent by establishing a connection to sparse signal detection, deriving an information-theoretic trade-off that shows the number of genes needed scales as m = Θ(1/[f^2 √k]) to detect a branch of length f.
We consider the reconstruction of a phylogeny from multiple genes under the multispecies coalescent. We establish a connection with the sparse signal detection problem, where one seeks to distinguish between a distribution and a mixture of the distribution and a sparse signal. Using this connection, we derive an information-theoretic trade-off between the number of genes, $m$, needed for an accurate reconstruction and the sequence length, $k$, of the genes. Specifically, we show that to detect a branch of length $f$, one needs $m = Θ(1/[f^{2} \sqrt{k}])$.