Gonzalo Navarro

DS
10papers
155citations
Novelty51%
AI Score41

10 Papers

86.0FLApr 20
Smallest Suffixient Sets: Effectiveness, Resilience, and Calculation

Hiroto Fujimaru, Gonzalo Navarro, Giuseppe Romana et al.

A suffixient set is a novel combinatorial object that captures the essential information of repetitive strings in a way that, provided with a random access mechanism, supports various forms of pattern matching. In this paper, we study the size $χ$ of the smallest suffixient set as a repetitiveness measure. First, we study its sensitivity to various string operations. We show that $χ$ cannot increase by more than 2 after appending or prepending a character to the string. As a consequence, we are able to give simple linear-time online algorithms to compute smallest suffixient sets. We also show that, although reversing the string can increase $χ$ by an arbitrary $O(n)$ value, it always holds $χ(T)/χ(T^R)\le 2$. We also prove lower and upper bounds for the additive or multiplicative increase of $χ$ after applying arbitrary edit operations, or rotating the text. In particular, we show that the additive increase can be as large as $Ω(\sqrt{n})$ for all those operations. Secondly, we place $χ$ in between known repetitiveness measures. In particular, we show $χ= O(r)$ (where $r$ is the number of runs in the Burrows-Wheeler Transform of the string), that there are string families where $χ=o(v)$ (where $v$ is the size of the smallext lexicographic parse of the string), and that $χ$ is uncomparable to almost all reachable measures based on copy-paste mechanisms. In passing, we give precise bounds for $χ$ for some relevant string families, for example $χ\le σ+2$ on episturmian words over alphabets of size $σ$ (e.g., $χ\le 4$ on Fibonacci strings, for which we precisely characterize the only two smallest suffixient sets).

DSNov 13, 2020
A grammar compressor for collections of reads with applications to the construction of the BWT

Diego Díaz-Domínguez, Gonzalo Navarro

We describe a grammar for DNA sequencing reads from which we can compute the BWT directly. Our motivation is to perform in succinct space genomic analyses that require complex string queries not yet supported by repetition-based self-indexes. Our approach is to store the set of reads as a grammar, but when required, compute its BWT to carry out the analysis by using self-indexes. Our experiments in real data showed that the space reduction we achieve with our compressor is competitive with LZ-based methods and better than entropy-based approaches. Compared to other popular grammars, in this kind of data, we achieve, on average, 12\% of extra compression and require less working space and time.

DSFeb 20, 2019
Fast, Small, and Simple Document Listing on Repetitive Text Collections

Dustin Cobas, Gonzalo Navarro

Document listing on string collections is the task of finding all documents where a pattern appears. It is regarded as the most fundamental document retrieval problem, and is useful in various applications. Many of the fastest-growing string collections are composed of very similar documents, such as versioned code and document collections, genome repositories, etc. Plain pattern-matching indexes designed for repetitive text collections achieve orders-of-magnitude reductions in space. Instead, there are not many analogous indexes for document retrieval. In this paper we present a simple document listing index for repetitive string collections of total length $n$ that lists the $ndoc$ distinct documents where a pattern of length $m$ appears in time $\mathcal{O}(m+ndoc \cdot \log n)$. We exploit the repetitiveness of the document array (i.e., the suffix array coarsened to document identifiers) to grammar-compress it while precomputing the answers to nonterminals, and store them in grammar-compressed form as well. Our experimental results show that our index sharply outperforms existing alternatives in the space/time tradeoff map.

IRMay 30, 2016
Document Retrieval on Repetitive String Collections

Travis Gagie, Aleksi Hartikainen, Kalle Karhu et al.

Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, {\em interleaved LCPs} and {\em precomputed document lists}, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-$k$ document retrieval (find the $k$ documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple tf-idf model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.

IRApr 29, 2016
Universal Indexes for Highly Repetitive Document Collections

Francisco Claude, Antonio Fariña, Miguel A. Martínez-Prieto et al.

Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.

DSApr 19, 2014
Document Retrieval on Repetitive Collections

Gonzalo Navarro, Simon J. Puglisi, Jouni Sirén

Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional pattern-matching techniques yield brute-force document retrieval solutions, which has motivated the research on tailored indexes that offer near-optimal performance. However, an experimental study establishing which alternatives are actually better than brute force, and which perform best depending on the collection characteristics, has not been carried out. In this paper we address this shortcoming by exploring the relationship between the nature of the underlying collection and the performance of current methods. Via extensive experiments we show that established solutions are often beaten in practice by brute-force alternatives. We also design new methods that offer superior time/space trade-offs, particularly on repetitive collections.

DBOct 18, 2013
Compressed Vertical Partitioning for Full-In-Memory RDF Management

Sandra Álvarez-García, Nieves R. Brisaboa, Javier D. Fernández et al.

The Web of Data has been gaining momentum and this leads to increasingly publish more semi-structured datasets following the RDF model, based on atomic triple units of subject, predicate, and object. Although it is a simple model, compression methods become necessary because datasets are increasingly larger and various scalability issues arise around their organization and storage. This requirement is more restrictive in RDF stores because efficient SPARQL resolution on the compressed RDF datasets is also required. This article introduces a novel RDF indexing technique (called k2-triples) supporting efficient SPARQL resolution in compressed space. k2-triples, uses the predicate to vertically partition the dataset into disjoint subsets of pairs (subject, object), one per predicate. These subsets are represented as binary matrices in which 1-bits mean that the corresponding triple exists in the dataset. This model results in very sparse matrices, which are efficiently compressed using k2-trees. We enhance this model with two compact indexes listing the predicates related to each different subject and object, in order to address the specific weaknesses of vertically partitioned representations. The resulting technique not only achieves by far the most compressed representations, but also the best overall performance for RDF retrieval in our experiments. Our approach uses up to 10 times less space than a state of the art baseline, and outperforms its performance by several order of magnitude on the most basic query patterns. In addition, we optimize traditional join algorithms on k2-triples and define a novel one leveraging its specific features. Our experimental results show that our technique overcomes traditional vertical partitioning for join resolution, reporting the best numbers for joins in which the non-joined nodes are provided, and being competitive in the majority of the cases.

DSJul 25, 2013
Optimal Top-k Document Retrieval

Gonzalo Navarro, Yakov Nekrich

Let $\mathcal{D}$ be a collection of $D$ documents, which are strings over an alphabet of size $σ$, of total length $n$. We describe a data structure that uses linear space and and reports $k$ most relevant documents that contain a query pattern $P$, which is a string of length $p$, in time $O(p/\log_σn+k)$, which is optimal in the RAM model in the general case where $\lg D = Θ(\log n)$, and involves a novel RAM-optimal suffix tree search. Our construction supports an ample set of important relevance measures... [clip] When $\lg D = o(\log n)$, we show how to reduce the space of the data structure from $O(n\log n)$ to $O(n(\logσ+\log D+\log\log n))$ bits... [clip] We also consider the dynamic scenario, where documents can be inserted and deleted from the collection. We obtain linear space and query time $O(p(\log\log n)^2/\log_σn+\log n + k\log\log k)$, whereas insertions and deletions require $O(\log^{1+ε} n)$ time per symbol, for any constant $ε>0$. Finally, we consider an extended static scenario where an extra parameter $par(P,d)$ is defined, and the query must retrieve only documents $d$ such that $par(P,d)\in [τ_1,τ_2]$, where this range is specified at query time. We solve these queries using linear space and $O(p/\log_σn + \log^{1+ε} n + k\log^εn)$ time, for any constant $ε>0$. Our technique is to translate these top-$k$ problems into multidimensional geometric search problems. As an additional bonus, we describe some improvements to those problems.

IRApr 22, 2013
Spaces, Trees and Colors: The Algorithmic Landscape of Document Retrieval on Sequences

Gonzalo Navarro

Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to "natural language" text collections, where inverted indices are the preferred solution. As successful as this paradigm has been, it fails to properly handle some East Asian languages and other scenarios where the "natural language" assumptions do not hold. In this survey we cover the recent research in extending the document retrieval techniques to a broader class of sequence collections, which has applications bioinformatics, data and Web mining, chemoinformatics, software engineering, multimedia information retrieval, and many others. We focus on the algorithmic aspects of the techniques, uncovering a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and others.

IRJul 23, 2012
Ranked Document Retrieval in (Almost) No Space

Nieves R. Brisaboa, Ana Cerdeira-Pena, Gonzalo Navarro et al.

Ranked document retrieval is a fundamental task in search engines. Such queries are solved with inverted indexes that require additional 45%-80% of the compressed text space, and take tens to hundreds of microseconds per query. In this paper we show how ranked document retrieval queries can be solved within tens of milliseconds using essentially no extra space over an in-memory compressed representation of the document collection. More precisely, we enhance wavelet trees on bytecodes (WTBCs), a data structure that rearranges the bytes of the compressed collection, so that they support ranked conjunctive and disjunctive queries, using just 6%-18% of the compressed text space.