Antoine Amarilli

DSMay 21

Gray Codes With Constant Delay and Constant Auxiliary Space

Antoine Amarilli, Claire David, Nadime Francis et al.

We give the first two algorithms to enumerate all binary words of $\{0,1\}^\ell$ (like Gray codes) while ensuring that the delay and the auxiliary space is independent from $\ell$, i.e., constant time for each word, and constant memory in addition to the $\ell$ bits storing the current word. Our algorithms are given in two new computational models: tape machines and deque machines. We also study more restricted models, queue machines and stack machines, and show that they cannot enumerate all binary words with constant auxiliary space, even with unrestricted delay. A tape machine is a Turing machine that stores the current binary word on a single working tape of length $\ell$ (which never increases), using no other tape. The machine has a single head and must edit its tape to reach all possible words of $\{0,1\}^\ell$, and output them (in unit time, by entering special output states), with no duplicates. Hence a tape machine uses constant auxiliary space by definition (up to the head position). We construct a tape machine that achieves this task with constant delay between consecutive outputs, so that the machine implements a so-called skew-tolerant quasi-Gray code. We then construct a more involved tape machine that implements a Gray code. A deque machine stores the current binary word on a double-ended queue of length $\ell$, and stores a constant-size internal state. It works as a tape machine, except that it modifies the content of the deque by performing push and pop operations on the endpoints. Hence again a deque machine uses constant auxiliary space by definition. We construct deque machines that enumerate all words of $\{0,1\}^\ell$ with constant-delay. The main technical challenge in this model is to correctly detect when enumeration has finished.

FLMay 8

Out-of-Order Membership in Regular Languages

Antoine Amarilli, Sebastien Labbe, Charles Paperman

We introduce the task of out-of-order membership to a formal language L, where the letters of a word w are revealed one by one in an adversarial order. The length |w| is known in advance, but the content of w is streamed as pairs (i, w[i]), received exactly once for each position i, in arbitrary order. We study efficient algorithms for this task when L is regular, seeking tight complexity bounds as a function of |w| for a fixed target language. Most of our results apply to an algebraically defined variant dubbed out-of-order evaluation: this problem is defined for a fixed finite monoid or semigroup S, and our goal is to compute the ordered product of the streamed elements of w. We show that, for any fixed regular language or finite semigroup, both problems can be solved in constant time per streamed symbol and in linear space. However, the precise space complexity strongly depends on the algebraic structure of the target language or evaluation semigroup. Our main contributions are therefore to show (deterministic) space complexity characterizations, which we do for out-of-order evaluation of monoids and semigroups. For monoids, we establish a trichotomy: the space complexity is either Θ(1), Θ(log n), or Θ(n), where n = |w|. More specifically, the problem admits a constant-space solution for commutative monoids, while all non-commutative monoids require Ω(log n) space. We further identify a class of monoids admitting an O(log n)-space algorithm, and show that all remaining monoids require Ω(n) space. For general semigroups, the situation is more intricate. We characterize a class of semigroups admitting constant-space algorithms for out-of-order evaluation, and show that semigroups outside this class require at least Ω(log n) space.

DSMay 5

The S-Hamiltonian Cycle Problem

Antoine Amarilli, Arthur Lombardo, Mikaël Monet

Determining if an input undirected graph is Hamiltonian, i.e., if it has a cycle that visits every vertex exactly once, is one of the most famous NP-complete problems. We consider the following generalization of Hamiltonian cycles: for a fixed set $S$ of natural numbers, we want to visit each vertex of a graph $G$ exactly once and ensure that any two consecutive vertices can be joined in $k$ hops for some choice of $k \in S$. Formally, an $S$-Hamiltonian cycle is a permutation $(v_0,\ldots,v_{n-1})$ of the vertices of $G$ such that, for $0 \leq i \leq n-1$, there exists a walk between $v_i$ and $v_{i+1 \bmod n}$ whose length is in $S$. (We do not impose any constraints on how many times vertices can be visited as intermediate vertices of walks.) Of course Hamiltonian cycles in the standard sense correspond to $S=\{1\}$. We study the $S$-Hamiltonian cycle problem of deciding whether an input graph $G$ has an $S$-Hamiltonian cycle. Our goal is to determine the complexity of this problem depending on the fixed set $S$. It is already known that the problem remains NP-complete for $S=\{1,2\}$, whereas it is trivial for $S=\{1,2,3\}$ because any connected graph contains a $\{1,2,3\}$-Hamiltonian cycle. Our work classifies the complexity of this problem for most kinds of sets $S$, with the key new results being the following: we have NP-completeness for $S = \{2\}$ and for $S = \{2, 4\}$, but tractability for $S = \{1, 2, 4\}$, for $S = \{2, 4, 6\}$, for any superset of these two tractable cases, and for $S$ the infinite set of all odd integers. The remaining open cases are the non-singleton finite sets of odd integers, in particular $S = \{1, 3\}$. Beyond cycles, we also discuss the complexity of finding $S$-Hamiltonian paths, and show that our problems are all tractable on graphs of bounded cliquewidth.

LOMar 21

Tighter Bounds for Query Answering with Guarded TGDs

Antoine Amarilli, Michael Benedikt

We consider the complexity of the open-world query answering problem, where we wish to determine certain answers to conjunctive queries over incomplete datasets specified by an initial set of facts and a set of guarded TGDs. This problem has been well-studied in the literature and is decidable but with a high complexity, namely, it is 2EXPTIME complete. Further, the complexity shrinks by one exponential when the arity is fixed. We show in this paper how we can obtain better complexity bounds when considering separately the arity of the guard atom and that of the additional atoms, called the side signature. Our results make use of the technique of linearizing guarded TGDs, introduced in Gottlob, Manna, and Pieris. Specifically, we present a variant of the linearization process, making use of a restricted version of the chase that we recently introduced. Our results imply that open-world query answering with guarded TGDs can be solved in EXPTIME with arbitrary-arity guard relations if we simply bound the arity of the side signature; and that the complexity drops to NP if we fix the side signature and bound the width of the dependencies.

LOFeb 17, 2022

Query Answering with Transitive and Linear-Ordered Data

Antoine Amarilli, Michael Benedikt, Pierre Bourhis et al.

We consider entailment problems involving powerful constraint languages such as frontier-guarded existential rules in which we impose additional semantic restrictions on a set of distinguished relations. We consider restricting a relation to be transitive, restricting a relation to be the transitive closure of another relation, and restricting a relation to be a linear order. We give some natural variants of guardedness that allow inference to be decidable in each case, and isolate the complexity of the corresponding decision problems. Finally we show that slight changes in these conditions lead to undecidability.

DBMar 5, 2020

Constant-Delay Enumeration for Nondeterministic Document Spanners

Antoine Amarilli, Pierre Bourhis, Stefan Mengel et al.

We consider the information extraction framework known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm which is tractable in combined complexity, i.e., in the sizes of the input document and the VA; while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS'18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, in particular for the restricted case of so-called extended VAs. Finally, we evaluate our algorithm empirically using a prototype implementation.

AIJun 1, 2019

Smoothing Structured Decomposable Circuits

Andy Shih, Guy Van den Broeck, Paul Beame et al.

We study the task of smoothing a circuit, i.e., ensuring that all children of a plus-gate mention the same variables. Circuits serve as the building blocks of state-of-the-art inference algorithms on discrete probabilistic graphical models and probabilistic programs. They are also important for discrete density estimation algorithms. Many of these tasks require the input circuit to be smooth. However, smoothing has not been studied in its own right yet, and only a trivial quadratic algorithm is known. This paper studies efficient smoothing for structured decomposable circuits. We propose a near-linear time algorithm for this task and explore lower bounds for smoothing decomposable circuits, using existing results on range-sum queries. Further, for the important case of All-Marginals, we show a more efficient linear-time algorithm. We validate experimentally the performance of our methods.

DBJul 24, 2018

Constant-Delay Enumeration for Nondeterministic Document Spanners

Antoine Amarilli, Pierre Bourhis, Stefan Mengel et al.

We consider the information extraction framework known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm which is tractable in combined complexity, i.e., in the sizes of the input document and the VA; while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS'18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, in particular for the restricted case of so-called extended VAs.

DBMay 4, 2015

Harvesting Entities from the Web Using Unique Identifiers -- IBEX

Aliaksandr Talaika, Joanna Biega, Antoine Amarilli et al.

In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with human-readable names for the entities at large scale. Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73--96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web.

DBDec 11, 2013

On the Complexity of Mining Itemsets from the Crowd Using Taxonomies

Antoine Amarilli, Yael Amsterdamer, Tova Milo

We study the problem of frequent itemset mining in domains where data is not recorded in a conventional database but only exists in human knowledge. We provide examples of such scenarios, and present a crowdsourcing model for them. The model uses the crowd as an oracle to find out whether an itemset is frequent or not, and relies on a known taxonomy of the item domain to guide the search for frequent itemsets. In the spirit of data mining with oracles, we analyze the complexity of this problem in terms of (i) crowd complexity, that measures the number of crowd questions required to identify the frequent itemsets; and (ii) computational complexity, that measures the computational effort required to choose the questions. We provide lower and upper complexity bounds in terms of the size and structure of the input taxonomy, as well as the size of a concise description of the output itemsets. We also provide constructive algorithms that achieve the upper bounds, and consider more efficient variants for practical situations.

Antoine Amarilli

10 Papers