Omri Ben-Eliezer

9papers

354citations

Novelty62%

AI Score29

Ranked #150,779 of 201,326 authors (top 75%)#480 in DS (top 85%)

9 Papers

LGJan 24, 2022

Active Learning Polynomial Threshold Functions

Omri Ben-Eliezer, Max Hopkins, Chutong Yang et al.

We initiate the study of active learning polynomial threshold functions (PTFs). While traditional lower bounds imply that even univariate quadratics cannot be non-trivially actively learned, we show that allowing the learner basic access to the derivatives of the underlying classifier circumvents this issue and leads to a computationally efficient algorithm for active learning degree-$d$ univariate PTFs in $\tilde{O}(d^3\log(1/\varepsilonδ))$ queries. We also provide near-optimal algorithms and analyses for active learning PTFs in several average case settings. Finally, we prove that access to derivatives is insufficient for active learning multivariate PTFs, even those of just two variables.

DSJan 10, 2022

Bounded Space Differentially Private Quantiles

Daniel Alabi, Omri Ben-Eliezer, Anamay Chaturvedi

Estimating the quantiles of a large dataset is a fundamental problem in both the streaming algorithms literature and the differential privacy literature. However, all existing private mechanisms for distribution-independent quantile computation require space at least linear in the input size $n$. In this work, we devise a differentially private algorithm for the quantile estimation problem, with strongly sublinear space complexity, in the one-shot and continual observation settings. Our basic mechanism estimates any $α$-approximate quantile of a length-$n$ stream over a data universe $\mathcal{X}$ with probability $1-β$ using $O\left( \frac{\log (|\mathcal{X}|/β) \log (αεn)}{αε} \right)$ space while satisfying $ε$-differential privacy at a single time point. Our approach builds upon deterministic streaming algorithms for non-private quantile estimation instantiating the exponential mechanism using a utility function defined on sketch items, while (privately) sampling from intervals defined by the sketch. We also present another algorithm based on histograms that is especially suited to the multiple quantiles case. We implement our algorithms and experimentally evaluate them on synthetic and real-world datasets.

AIOct 19, 2021

What is Learned in Knowledge Graph Embeddings?

Michael R. Douglas, Michael Simkin, Omri Ben-Eliezer et al.

A knowledge graph (KG) is a data structure which represents entities and relations as the vertices and edges of a directed graph with edge types. KGs are an important primitive in modern machine learning and artificial intelligence. Embedding-based models, such as the seminal TransE [Bordes et al., 2013] and the recent PairRE [Chao et al., 2020] are among the most popular and successful approaches for representing KGs and inferring missing edges (link completion). Their relative success is often credited in the literature to their ability to learn logical rules between the relations. In this work, we investigate whether learning rules between relations is indeed what drives the performance of embedding-based methods. We define motif learning and two alternative mechanisms, network learning (based only on the connectivity of the KG, ignoring the relation types), and unstructured statistical learning (ignoring the connectivity of the graph). Using experiments on synthetic KGs, we show that KG models can learn motifs and how this ability is degraded by non-motif (noise) edges. We propose tests to distinguish the contributions of the three mechanisms to performance, and apply them to popular KG benchmarks. We also discuss an issue with the standard performance testing protocol and suggest an improvement. To appear in the proceedings of Complex Networks 2021.

CVMar 18, 2021

Learning Multimodal Affinities for Textual Editing in Images

Or Perel, Oron Anschel, Omri Ben-Eliezer et al.

Nowadays, as cameras are rapidly adopted in our daily routine, images of documents are becoming both abundant and prevalent. Unlike natural images that capture physical objects, document-images contain a significant amount of text with critical semantics and complicated layouts. In this work, we devise a generic unsupervised technique to learn multimodal affinities between textual entities in a document-image, considering their visual style, the content of their underlying text and their geometric context within the image. We then use these learned affinities to automatically cluster the textual entities in the image into different semantic groups. The core of our approach is a deep optimization scheme dedicated for an image provided by the user that detects and leverages reliable pairwise connections in the multimodal representation of the textual elements in order to properly learn the affinities. We show that our technique can operate on highly varying images spanning a wide range of documents and demonstrate its applicability for various editing operations manipulating the content, appearance and geometry of the image.

LGJan 22, 2021

Adversarial Laws of Large Numbers and Optimal Regret in Online Classification

Noga Alon, Omri Ben-Eliezer, Yuval Dagan et al.

Laws of large numbers guarantee that given a large enough sample from some population, the measure of any fixed sub-population is well-estimated by its frequency in the sample. We study laws of large numbers in sampling processes that can affect the environment they are acting upon and interact with it. Specifically, we consider the sequential sampling model proposed by Ben-Eliezer and Yogev (2020), and characterize the classes which admit a uniform law of large numbers in this model: these are exactly the classes that are \emph{online learnable}. Our characterization may be interpreted as an online analogue to the equivalence between learnability and uniform convergence in statistical (PAC) learning. The sample-complexity bounds we obtain are tight for many parameter regimes, and as an application, we determine the optimal regret bounds in online learning, stated in terms of \emph{Littlestone's dimension}, thus resolving the main open question from Ben-David, Pál, and Shalev-Shwartz (2009), which was also posed by Rakhlin, Sridharan, and Tewari (2015).

DSMar 31, 2020

A Framework for Adversarially Robust Streaming Algorithms

Omri Ben-Eliezer, Rajesh Jayaram, David P. Woodruff et al.

We investigate the adversarial robustness of streaming algorithms. In this context, an algorithm is considered robust if its performance guarantees hold even if the stream is chosen adaptively by an adversary that observes the outputs of the algorithm along the stream and can react in an online manner. While deterministic streaming algorithms are inherently robust, many central problems in the streaming literature do not admit sublinear-space deterministic algorithms; on the other hand, classical space-efficient randomized algorithms for these problems are generally not adversarially robust. This raises the natural question of whether there exist efficient adversarially robust (randomized) streaming algorithms for these problems. In this work, we show that the answer is positive for various important streaming problems in the insertion-only model, including distinct elements and more generally $F_p$-estimation, $F_p$-heavy hitters, entropy estimation, and others. For all of these problems, we develop adversarially robust $(1+\varepsilon)$-approximation algorithms whose required space matches that of the best known non-robust algorithms up to a $\text{poly}(\log n, 1/\varepsilon)$ multiplicative factor (and in some cases even up to a constant factor). Towards this end, we develop several generic tools allowing one to efficiently transform a non-robust streaming algorithm into a robust one in various scenarios.

CVSep 1, 2019

READ: Recursive Autoencoders for Document Layout Generation

Akshay Gadi Patil, Omri Ben-Eliezer, Or Perel et al.

Layout is a fundamental component of any graphic design. Creating large varieties of plausible document layouts can be a tedious task, requiring numerous constraints to be satisfied, including local ones relating different semantic elements and global constraints on the general appearance and spacing. In this paper, we present a novel framework, coined READ, for REcursive Autoencoders for Document layout generation, to generate plausible 2D layouts of documents in large quantities and varieties. First, we devise an exploratory recursive method to extract a structural decomposition of a single document. Leveraging a dataset of documents annotated with labeled bounding boxes, our recursive neural network learns to map the structural representation, given in the form of a simple hierarchy, to a compact code, the space of which is approximated by a Gaussian distribution. Novel hierarchies can be sampled from this space, obtaining new document layouts. Moreover, we introduce a combinatorial metric to measure structural similarity among document layouts. We deploy it to show that our method is able to generate highly variable and realistic layouts. We further demonstrate the utility of our generated layouts in the context of standard detection tasks on documents, showing that detection performance improves when the training data is augmented with generated documents whose layouts are produced by READ.

DSJun 26, 2019

The Adversarial Robustness of Sampling

Omri Ben-Eliezer, Eylon Yogev

Random sampling is a fundamental primitive in modern algorithms, statistics, and machine learning, used as a generic method to obtain a small yet "representative" subset of the data. In this work, we investigate the robustness of sampling against adaptive adversarial attacks in a streaming setting: An adversary sends a stream of elements from a universe $U$ to a sampling algorithm (e.g., Bernoulli sampling or reservoir sampling), with the goal of making the sample "very unrepresentative" of the underlying data stream. The adversary is fully adaptive in the sense that it knows the exact content of the sample at any given point along the stream, and can choose which element to send next accordingly, in an online manner. Well-known results in the static setting indicate that if the full stream is chosen in advance (non-adaptively), then a random sample of size $Ω(d / \varepsilon^2)$ is an $\varepsilon$-approximation of the full data with good probability, where $d$ is the VC-dimension of the underlying set system $(U,R)$. Does this sample size suffice for robustness against an adaptive adversary? The simplistic answer is \emph{negative}: We demonstrate a set system where a constant sample size (corresponding to VC-dimension $1$) suffices in the static setting, yet an adaptive adversary can make the sample very unrepresentative, as long as the sample size is (strongly) sublinear in the stream length, using a simple and easy-to-implement attack. However, this attack is "theoretical only", requiring the set system size to (essentially) be exponential in the stream length. This is not a coincidence: We show that to make Bernoulli or reservoir sampling robust against adaptive adversaries, the modification required is solely to replace the VC-dimension term $d$ in the sample size with the cardinality term $\log |R|$. This nearly matches the bound imposed by the attack.

DSJul 13, 2016

Deleting and Testing Forbidden Patterns in Multi-Dimensional Arrays

Omri Ben-Eliezer, Simon Korman, Daniel Reichman

Understanding the local behaviour of structured multi-dimensional data is a fundamental problem in various areas of computer science. As the amount of data is often huge, it is desirable to obtain sublinear time algorithms, and specifically property testers, to understand local properties of the data. We focus on the natural local problem of testing pattern freeness: given a large $d$-dimensional array $A$ and a fixed $d$-dimensional pattern $P$ over a finite alphabet, we say that $A$ is $P$-free if it does not contain a copy of the forbidden pattern $P$ as a consecutive subarray. The distance of $A$ to $P$-freeness is the fraction of entries of $A$ that need to be modified to make it $P$-free. For any $ε\in [0,1]$ and any large enough pattern $P$ over any alphabet, other than a very small set of exceptional patterns, we design a tolerant tester that distinguishes between the case that the distance is at least $ε$ and the case that it is at most $a_d ε$, with query complexity and running time $c_d ε^{-1}$, where $a_d < 1$ and $c_d$ depend only on $d$. To analyze the testers we establish several combinatorial results, including the following $d$-dimensional modification lemma, which might be of independent interest: for any large enough pattern $P$ over any alphabet (excluding a small set of exceptional patterns for the binary case), and any array $A$ containing a copy of $P$, one can delete this copy by modifying one of its locations without creating new $P$-copies in $A$. Our results address an open question of Fischer and Newman, who asked whether there exist efficient testers for properties related to tight substructures in multi-dimensional structured data. They serve as a first step towards a general understanding of local properties of multi-dimensional arrays, as any such property can be characterized by a fixed family of forbidden patterns.