DBIRApr 20, 2018

Benchmarking Top-K Keyword and Top-K Document Processing with T${}^2$K${}^2$ and T${}^2$K${}^2$D${}^2$

arXiv:1804.07525v1
Originality Synthesis-oriented
AI Analysis

This provides a benchmarking tool for researchers and practitioners in text analysis to compare weighting schemes and database systems, though it is incremental as it addresses a specific gap without introducing new methods.

The paper tackles the lack of benchmarks for evaluating top-k keyword and document extraction methods by introducing T^2K^2 and T^2K^2D^2, which use a real tweet dataset to test weighting schemes and database implementations, showing performance results for TF-IDF, Okapi BM25, and databases like Oracle and MongoDB.

Top-k keyword and top-k document extraction are very popular text analysis techniques. Top-k keywords and documents are often computed on-the-fly, but they exploit weighted vocabularies that are costly to build. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present T${}^2$K${}^2$, a top-k keywords and documents benchmark, and its decision support-oriented evolution T${}^2$K${}^2$D${}^2$. Both benchmarks feature a real tweet dataset and queries with various complexities and selectivities. They help evaluate weighting schemes and database implementations in terms of computing performance. To illustrate our bench-marks' relevance and genericity, we successfully ran performance tests on the TF-IDF and Okapi BM25 weighting schemes, on one hand, and on different relational (Oracle, PostgreSQL) and document-oriented (MongoDB) database implementations, on the other hand.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes