Nadiia Chepurko

DS
5papers
114citations
Novelty63%
AI Score28

5 Papers

CVMar 30, 2022
Learning Program Representations for Food Images and Cooking Recipes

Dim P. Papadopoulos, Enrique Mora, Nadiia Chepurko et al.

In this paper, we are interested in modeling a how-to instructional procedure, such as a cooking recipe, with a meaningful and rich high-level representation. Specifically, we propose to represent cooking recipes and food images as cooking programs. Programs provide a structured representation of the task, capturing cooking semantics and sequential relationships of actions in the form of a graph. This allows them to be easily manipulated by users and executed by agents. To this end, we build a model that is trained to learn a joint embedding between recipes and food images via self-supervision and jointly generate a program from this embedding as a sequence. To validate our idea, we crowdsource programs for cooking recipes and show that: (a) projecting the image-recipe embeddings into programs leads to better cross-modal retrieval results; (b) generating programs from images leads to better recognition results compared to predicting raw cooking instructions; and (c) we can generate food images by manipulating programs via optimizing the latent code of a GAN. Code, data, and models are available online.

DSJul 16, 2021
Near-Optimal Algorithms for Linear Algebra in the Current Matrix Multiplication Time

Nadiia Chepurko, Kenneth L. Clarkson, Praneeth Kacham et al.

In the numerical linear algebra community, it was suggested that to obtain nearly optimal bounds for various problems such as rank computation, finding a maximal linearly independent subset of columns (a basis), regression, or low-rank approximation, a natural way would be to resolve the main open question of Nelson and Nguyen (FOCS, 2013). This question is regarding the logarithmic factors in the sketching dimension of existing oblivious subspace embeddings that achieve constant-factor approximation. We show how to bypass this question using a refined sketching technique, and obtain optimal or nearly optimal bounds for these problems. A key technique we use is an explicit mapping of Indyk based on uncertainty principles and extractors, which after first applying known oblivious subspace embeddings, allows us to quickly spread out the mass of the vector so that sampling is now effective. We thereby avoid a logarithmic factor in the sketching dimension that is standard in bounds proven using the matrix Chernoff inequality. For the fundamental problems of rank computation and finding a basis, our algorithms improve Cheung, Kwok, and Lau (JACM, 2013), and are optimal to within a constant factor and a poly(log log(n))-factor, respectively. Further, for constant-factor regression and low-rank approximation we give the first optimal algorithms, for the current matrix multiplication exponent.

DSNov 9, 2020
Quantum-Inspired Algorithms from Randomized Numerical Linear Algebra

Nadiia Chepurko, Kenneth L. Clarkson, Lior Horesh et al.

We create classical (non-quantum) dynamic data structures supporting queries for recommender systems and least-squares regression that are comparable to their quantum analogues. De-quantizing such algorithms has received a flurry of attention in recent years; we obtain sharper bounds for these problems. More significantly, we achieve these improvements by arguing that the previous quantum-inspired algorithms for these problems are doing leverage or ridge-leverage score sampling in disguise; these are powerful and standard techniques in randomized numerical linear algebra. With this recognition, we are able to employ the large body of work in numerical linear algebra to obtain algorithms for these problems that are simpler or faster (or both) than existing approaches. Our experiments demonstrate that the proposed data structures also work well on real-world datasets.

LGMar 21, 2020
ARDA: Automatic Relational Data Augmentation for Machine Learning

Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen et al.

Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal ``human-in-the-loop'' involvement. We present \system, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets.

DSDec 9, 2019
Robust and Sample Optimal Algorithms for PSD Low-Rank Approximation

Ainesh Bakshi, Nadiia Chepurko, David P. Woodruff

Recently, Musco and Woodruff (FOCS, 2017) showed that given an $n \times n$ positive semidefinite (PSD) matrix $A$, it is possible to compute a $(1+ε)$-approximate relative-error low-rank approximation to $A$ by querying $O(nk/ε^{2.5})$ entries of $A$ in time $O(nk/ε^{2.5} +n k^{ω-1}/ε^{2(ω-1)})$. They also showed that any relative-error low-rank approximation algorithm must query $Ω(nk/ε)$ entries of $A$, this gap has since remained open. Our main result is to resolve this question by obtaining an optimal algorithm that queries $O(nk/ε)$ entries of $A$ and outputs a relative-error low-rank approximation in $O(n(k/ε)^{ω-1})$ time. Note, our running time improves that of Musco and Woodruff, and matches the information-theoretic lower bound if the matrix-multiplication exponent $ω$ is $2$. We then extend our techniques to negative-type distance matrices. Bakshi and Woodruff (NeurIPS, 2018) showed a bi-criteria, relative-error low-rank approximation which queries $O(nk/ε^{2.5})$ entries and outputs a rank-$(k+4)$ matrix. We show that the bi-criteria guarantee is not necessary and obtain an $O(nk/ε)$ query algorithm, which is optimal. Our algorithm applies to all distance matrices that arise from metrics satisfying negative-type inequalities, including $\ell_1, \ell_2,$ spherical metrics and hypermetrics. Next, we introduce a new robust low-rank approximation model which captures PSD matrices that have been corrupted with noise. While a sample complexity lower bound precludes sublinear algorithms for arbitrary PSD matrices, we provide the first sublinear time and query algorithms when the corruption on the diagonal entries is bounded. As a special case, we show sample-optimal sublinear time algorithms for low-rank approximation of correlation matrices corrupted by noise.