IR CLMar 11, 2022

Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval

Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan

CMU

arXiv:2203.05765v125.489 citationsh-index: 79Has Code

Originality Synthesis-oriented

AI Analysis

This toolkit addresses the problem of fragmented and inefficient software stacks for researchers in dense retrieval, offering a flexible foundation for system research.

The authors tackled the lack of efficient and flexible software for dense retrieval by introducing Tevatron, a toolkit that provides a standardized pipeline and demonstrates effectiveness and efficiency across multiple IR and QA datasets.

Recent rapid advancements in deep pre-trained language models and the introductions of large datasets have powered research in embedding-based dense retrieval. While several good research papers have emerged, many of them come with their own software stacks. These stacks are typically optimized for some particular research goals instead of efficiency or code structure. In this paper, we present Tevatron, a dense retrieval toolkit optimized for efficiency, flexibility, and code simplicity. Tevatron provides a standardized pipeline for dense retrieval including text processing, model training, corpus/query encoding, and search. This paper presents an overview of Tevatron and demonstrates its effectiveness and efficiency across several IR and QA data sets. We also show how Tevatron's flexible design enables easy generalization across datasets, model architectures, and accelerator platforms(GPU/TPU). We believe Tevatron can serve as an effective software foundation for dense retrieval system research including design, modeling, and optimization.

View on arXiv PDF Code

Similar