CLNov 22, 2022
On Narrative Information and the Distillation of StoriesDylan R. Ashley, Vincent Herrmann, Zachary Friggstad et al.
The act of telling stories is a fundamental part of what it means to be human. This work introduces the concept of narrative information, which we define to be the overlap in information space between a story and the items that compose the story. Using contrastive learning methods, we show how modern artificial neural networks can be leveraged to distill stories and extract a representation of the narrative information. We then demonstrate how evolutionary algorithms can leverage this to extract a set of narrative templates and how these templates -- in tandem with a novel curve-fitting algorithm we introduce -- can reorder music albums to automatically induce stories in them. In the process of doing so, we give strong statistical evidence that these narrative information templates are present in existing albums. While we experiment only with music albums here, the premises of our work extend to any form of (largely) independent media.
DSMar 20
Approximating Traveling Salesman Problems Using a Bridge LemmaMartin Böhm, Zachary Friggstad, Tobias Mömke et al.
We give improved approximations for two metric Traveling Salesman Problem (TSP) variants. In Ordered TSP (OTSP) we are given a linear ordering on a subset of nodes $o_1, \ldots, o_k$. The TSP solution must have that $o_{i+1}$ is visited at some point after $o_i$ for each $1 \leq i < k$. This is the special case of Precedence-Constrained TSP ($PTSP$) in which the precedence constraints are given by a single chain on a subset of nodes. In $k$-Person TSP Path (k-TSPP), we are given pairs of nodes $(s_1,t_1), \ldots, (s_k,t_k)$. The goal is to find an $s_i$-$t_i$ path with minimum total cost such that every node is visited by at least one path. We obtain a $3/2 + e^{-1} < 1.878$ approximation for OTSP, the first improvement over a trivial $α+1$ approximation where $α$ is the current best TSP approximation. We also obtain a $1 + 2 \cdot e^{-1/2} < 2.214$ approximation for k-TSPP, the first improvement over a trivial $3$-approximation. These algorithms both use an adaptation of the Bridge Lemma that was initially used to obtain improved Steiner Tree approximations [Byrka et al., 2013]. Roughly speaking, our variant states that the cost of a cheapest forest rooted at a given set of terminal nodes will decrease by a substantial amount if we randomly sample a set of non-terminal nodes to also become terminals such provided each non-terminal has a constant probability of being sampled. We believe this view of the Bridge Lemma will find further use for improved vehicle routing approximations beyond this paper.
CLNov 3, 2021Code
Automatic Embedding of Stories Into Collections of Independent MediaDylan R. Ashley, Vincent Herrmann, Zachary Friggstad et al.
We look at how machine learning techniques that derive properties of items in a collection of independent media can be used to automatically embed stories into such collections. To do so, we use models that extract the tempo of songs to make a music playlist follow a narrative arc. Our work specifies an open-source tool that uses pre-trained neural network models to extract the global tempo of a set of raw audio files and applies these measures to create a narrative-following playlist. This tool is available at https://github.com/dylanashley/playlist-story-builder/releases/tag/v1.0.0
DSMay 16, 2024
A Polynomial-Time Approximation for Pairwise Fair $k$-Median ClusteringSayan Bandyapadhyay, Eden Chlamtáč, Zachary Friggstad et al.
In this work, we study pairwise fair clustering with $\ell \ge 2$ groups, where for every cluster $C$ and every group $i \in [\ell]$, the number of points in $C$ from group $i$ must be at most $t$ times the number of points in $C$ from any other group $j \in [\ell]$, for a given integer $t$. To the best of our knowledge, only bi-criteria approximation and exponential-time algorithms follow for this problem from the prior work on fair clustering problems when $\ell > 2$. In our work, focusing on the $\ell > 2$ case, we design the first polynomial-time $O(k^2\cdot \ell \cdot t)$-approximation for this problem with $k$-median cost that does not violate the fairness constraints. We complement our algorithmic result by providing hardness of approximation results, which show that our problem even when $\ell=2$ is almost as hard as the popular uniform capacitated $k$-median, for which no polynomial-time algorithm with an approximation factor of $o(\log k)$ is known.
LGOct 14, 2025
Budget-constrained Active Learning to Effectively De-censor Survival DataAli Parsaee, Bei Jiang, Zachary Friggstad et al.
Standard supervised learners attempt to learn a model from a labeled dataset. Given a small set of labeled instances, and a pool of unlabeled instances, a budgeted learner can use its given budget to pay to acquire the labels of some unlabeled instances, which it can then use to produce a model. Here, we explore budgeted learning in the context of survival datasets, which include (right) censored instances, where we know only a lower bound on an instance's time-to-event. Here, that learner can pay to (partially) label a censored instance -- e.g., to acquire the actual time for an instance [perhaps go from (3 yr, censored) to (7.2 yr, uncensored)], or other variants [e.g., learn about one more year, so go from (3 yr, censored) to either (4 yr, censored) or perhaps (3.2 yr, uncensored)]. This serves as a model of real world data collection, where follow-up with censored patients does not always lead to uncensoring, and how much information is given to the learner model during data collection is a function of the budget and the nature of the data itself. We provide both experimental and theoretical results for how to apply state-of-the-art budgeted learning algorithms to survival data and the respective limitations that exist in doing so. Our approach provides bounds and time complexity asymptotically equivalent to the standard active learning method BatchBALD. Moreover, empirical analysis on several survival tasks show that our model performs better than other potential approaches on several benchmarks.
DSMar 29, 2016
Local Search Yields a PTAS for k-Means in Doubling MetricsZachary Friggstad, Mohsen Rezapour, Mohammad R. Salavatipour
The most well known and ubiquitous clustering problem encountered in nearly every branch of science is undoubtedly $k$-means: given a set of data points and a parameter $k$, select $k$ centres and partition the data points into $k$ clusters around these centres so that the sum of squares of distances of the points to their cluster centre is minimized. Typically these data points lie $\mathbb{R}^d$ for some $d\geq 2$. $k$-means and the first algorithms for it were introduced in the 1950's. Since then, hundreds of papers have studied this problem and many algorithms have been proposed for it. The most commonly used algorithm is known as Lloyd-Forgy, which is also referred to as "the" $k$-means algorithm, and various extensions of it often work very well in practice. However, they may produce solutions whose cost is arbitrarily large compared to the optimum solution. Kanungo et al. [2004] analyzed a simple local search heuristic to get a polynomial-time algorithm with approximation ratio $9+ε$ for any fixed $ε>0$ for $k$-means in Euclidean space. Finding an algorithm with a better approximation guarantee has remained one of the biggest open questions in this area, in particular whether one can get a true PTAS for fixed dimension Euclidean space. We settle this problem by showing that a simple local search algorithm provides a PTAS for $k$-means in $\mathbb{R}^d$ for any fixed $d$. More precisely, for any error parameter $ε>0$, the local search algorithm that considers swaps of up to $ρ=d^{O(d)}\cdotε^{-O(d/ε)}$ centres at a time finds a solution using exactly $k$ centres whose cost is at most a $(1+ε)$-factor greater than the optimum. Finally, we provide the first demonstration that local search yields a PTAS for the uncapacitated facility location problem and $k$-median with non-uniform opening costs in doubling metrics.