Nima Asadi

2papers

2 Papers

IRMay 3, 2013
Fast, Incremental Inverted Indexing in Main Memory for Web-Scale Collections

Nima Asadi, Jimmy Lin

For text retrieval systems, the assumption that all data structures reside in main memory is increasingly common. In this context, we present a novel incremental inverted indexing algorithm for web-scale collections that directly constructs compressed postings lists in memory. Designing efficient in-memory algorithms requires understanding modern processor architectures and memory hierarchies: in this paper, we explore the issue of postings lists contiguity. Naturally, postings lists that occupy contiguous memory regions are preferred for retrieval, but maintaining contiguity increases complexity and slows indexing. On the other hand, allowing discontiguous index segments simplifies index construction but decreases retrieval performance. Understanding this tradeoff is our main contribution: We find that co-locating small groups of inverted list segments yields query evaluation performance that is statistically indistinguishable from fully-contiguous postings lists. In other words, it is not necessary to lay out in-memory data structures such that all postings for a term are contiguous; we can achieve ideal performance with a relatively small amount of effort.

IRFeb 21, 2013
Dynamic Memory Allocation Policies for Postings in Real-Time Twitter Search

Nima Asadi, Jimmy Lin, Michael Busch

We explore a real-time Twitter search application where tweets are arriving at a rate of several thousands per second. Real-time search demands that they be indexed and searchable immediately, which leads to a number of implementation challenges. In this paper, we focus on one aspect: dynamic postings allocation policies for index structures that are completely held in main memory. The core issue can be characterized as a "Goldilocks Problem". Because memory remains today a scare resource, an allocation policy that is too aggressive leads to inefficient utilization, while a policy that is too conservative is slow and leads to fragmented postings lists. We present a dynamic postings allocation policy that allocates memory in increasingly-larger "slices" from a small number of large, fixed pools of memory. Through analytical models and experiments, we explore different settings that balance time (query evaluation speed) and space (memory utilization).