Yiyuan Luo

h-index12

3papers

Novelty75%

AI Score48

Ranked #51,899 of 205,806 authors (top 25%)#145 in DS (top 25%)

3 Papers

29.8DSApr 19

Optimal Phylogenetic Reconstruction from Sampled Quartets

Dionysis Arvanitakis, Vaggos Chatziafratis, Yiyuan Luo et al.

Quartet Reconstruction, the task of recovering a phylogenetic tree from smaller trees on four species called \textit{quartets}, is a well-studied problem in theoretical computer science with far-reaching connections to statistics, graph theory and biology. Given a random sample containing $m$ noisy quartets, labeled by an unknown ground-truth tree $T$ on $n$ taxa, we want to output a tree $\widehat T$ that is \textit{close} to $T$ in terms of quartet distance and can predict unseen quartets. Unfortunately, the empirical risk minimizer corresponds to the $\mathsf{NP}$-hard problem of finding a tree that maximizes agreements with the sampled quartets, and earlier works in approximation algorithms gave $(1-\eps)$-approximation schemes (PTAS) for dense instances with $m=Θ(n^4)$ quartets, or for $m=Θ(n^2\log n)$ quartets \textit{randomly} sampled from $T$. Prior to our work, it was unknown how many samples are information-theoretically required to learn the tree, and whether there is an efficient reconstruction algorithm. We present optimal results for reconstructing an unknown phylogenetic tree $T$ from a random sample of $m=Θ(n)$ quartets, corrupted under the Random Classification Noise (RCN) model. This matches the $Ω(n)$ lower bound required for any meaningful tree reconstruction. Our contribution is twofold: first, we give a tree reconstruction algorithm that, not only achieves a $(1-\eps)$-approximation, but most importantly \textit{recovers} a tree close to $T$ in quartet distance; second, we show a new $Θ(n)$ bound on the Natarajan dimension of phylogenies (an analog of VC dimension in multiclass classification). Our analysis relies on a new \textit{Quartet-based Embedding and Detection} procedure that identifies and removes well-clustered subtrees from the (unknown) ground-truth $T$ via semidefinite programming.

40.8DSMay 5

Provable Accuracy Collapse in Embedding-Based Representations under Dimensionality Mismatch

Dionysis Arvanitakis, Vaggos Chatziafratis, Yiyuan Luo

Embedding-based representations in Euclidean space $\mathbb{R}^d$ are a cornerstone of modern machine learning, where a major goal is to use the \emph{smallest dimension} that faithfully captures data relations. In this work, we prove sharp dimension--accuracy tradeoffs and identify a fundamental information-theoretic limitation: unless the embedding dimension $d$ is chosen close to the ground-truth dimension $D$, accuracy undergoes a sudden collapse. Our main result shows that this phenomenon arises even in standard contrastive learning settings, where supervision is limited to a set of $m$ anchor--positive--negative triplets $(i,j,k)$ encoding distance comparisons $\mathrm{dist}(i,j) < \mathrm{dist}(i,k)$. Specifically, given triplets realizable by an unknown ground-truth embedding in $D$ dimensions, we prove that there exists constant $c < 1$, such that \emph{every embedding of dimension at most $cD$ violates half of the triplets}, yielding accuracy as low as a trivial one-dimensional solution that ignores the input. We complement our information-theoretic bounds with strong computational hardness results: under the Unique Games Conjecture, even if the given triplets are nearly realizable in $D=1$ dimension, no polynomial-time algorithm -- \textit{regardless of its dimension} -- can achieve accuracy above the trivial $50\%$ baseline.

LGSep 21, 2025

The Complexity of Finding Local Optima in Contrastive Learning

Jingming Yan, Yiyuan Luo, Vaggos Chatziafratis et al.

Contrastive learning is a powerful technique for discovering meaningful data representations by optimizing objectives based on $\textit{contrastive information}$, often given as a set of weighted triplets $\{(x_i, y_i^+, z_{i}^-)\}_{i = 1}^m$ indicating that an "anchor" $x_i$ is more similar to a "positive" example $y_i$ than to a "negative" example $z_i$. The goal is to find representations (e.g., embeddings in $\mathbb{R}^d$ or a tree metric) where anchors are placed closer to positive than to negative examples. While finding $\textit{global}$ optima of contrastive objectives is $\mathsf{NP}$-hard, the complexity of finding $\textit{local}$ optima -- representations that do not improve by local search algorithms such as gradient-based methods -- remains open. Our work settles the complexity of finding local optima in various contrastive learning problems by proving $\mathsf{PLS}$-hardness in discrete settings (e.g., maximize satisfied triplets) and $\mathsf{CLS}$-hardness in continuous settings (e.g., minimize Triplet Loss), where $\mathsf{PLS}$ (Polynomial Local Search) and $\mathsf{CLS}$ (Continuous Local Search) are well-studied complexity classes capturing local search dynamics in discrete and continuous optimization, respectively. Our results imply that no polynomial time algorithm (local search or otherwise) can find a local optimum for various contrastive learning problems, unless $\mathsf{PLS}\subseteq\mathsf{P}$ (or $\mathsf{CLS}\subseteq \mathsf{P}$ for continuous problems). Even in the unlikely scenario that $\mathsf{PLS}\subseteq\mathsf{P}$ (or $\mathsf{CLS}\subseteq \mathsf{P}$), our reductions imply that there exist instances where local search algorithms need exponential time to reach a local optimum, even for $d=1$ (embeddings on a line).