LG PFApr 23

Large-Scale Data Parallelization of Product Quantization and Inverted Indexing Using Dask

Ashley N. Abraham, Andrew Strelzoff, Haley R. Dozier, Althea C. Henslee, Mark A. Chappell

arXiv:2604.216450.7

Predicted impact top 98% in LG · last 90 daysOriginality Synthesis-oriented

AI Analysis

For practitioners needing efficient ANN search on large-scale data, this work provides a scalable Python-based solution that lowers computational barriers.

The paper tackles the computational expense of large-scale Approximate Nearest Neighbor search by parallelizing Product Quantization and Inverted Indexing using Dask, achieving reduced memory and time requirements while maintaining accuracy comparable to medium-scale data processing.

Large-scale Nearest Neighbor (NN) search, though widely utilized in the similarity search field, remains challenged by the computational limitations inherent in processing large scale data. In an effort to decrease the computational expense needed, Approximate Nearest Neighbor (ANN) search is often used in applications that do not require the exact similarity search, but instead can rely on an approximation. Product Quantization (PQ) is a memory-efficient ANN effective for clustering all sizes of datasets. Clustering large-scale, high dimensional data requires a heavy computational expense, in both memory-cost and execution time. This work focuses on a unique way to divide and conquer the large scale data in Python using PQ, Inverted Indexing and Dask, combining the results without compromising the accuracy and reducing computational requirements to the level required when using medium-scale data.

View on arXiv PDF

Similar