The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes
This work identifies a critical limitation for practitioners using dense low-dimensional information retrieval, particularly when scaling to very large datasets, by showing that their performance degrades faster than sparse methods.
This paper investigates the performance of dense low-dimensional information retrieval for large index sizes, finding that its performance degrades more rapidly than sparse representations. This can lead to a tipping point where sparse methods outperform dense ones due to an increased rate of false positives with lower dimensionality.
Information Retrieval using dense low-dimensional representations recently became popular and showed out-performance to traditional sparse-representations like BM25. However, no previous work investigated how dense representations perform with large index sizes. We show theoretically and empirically that the performance for dense representations decreases quicker than sparse representations for increasing index sizes. In extreme cases, this can even lead to a tipping point where at a certain index size sparse representations outperform dense representations. We show that this behavior is tightly connected to the number of dimensions of the representations: The lower the dimension, the higher the chance for false positives, i.e. returning irrelevant documents.