IR AI CLMay 19, 2024

DocReLM: Mastering Document Retrieval with Language Model

Gengchen Wei, Xinle Pang, Tianning Zhang, Yu Sun, Xun Qian, Chen Lin, Han-Sen Zhong, Wanli Ouyang

arXiv:2405.11461v12.24 citationsh-index: 10

Originality Incremental advance

AI Analysis

This addresses the challenge for academic researchers in efficiently searching vast document corpora with improved semantic understanding, representing a strong specific gain rather than a broad breakthrough.

The paper tackles the problem of semantic document retrieval in academic research by using large language models to train a retrieval system, achieving a Top 10 accuracy of 44.12% in computer vision and 36.21% in quantum physics, significantly outperforming Google Scholar.

With over 200 million published academic documents and millions of new documents being written each year, academic researchers face the challenge of searching for information within this vast corpus. However, existing retrieval systems struggle to understand the semantics and domain knowledge present in academic papers. In this work, we demonstrate that by utilizing large language models, a document retrieval system can achieve advanced semantic understanding capabilities, significantly outperforming existing systems. Our approach involves training the retriever and reranker using domain-specific data generated by large language models. Additionally, we utilize large language models to identify candidates from the references of retrieved papers to further enhance the performance. We use a test set annotated by academic researchers in the fields of quantum physics and computer vision to evaluate our system's performance. The results show that DocReLM achieves a Top 10 accuracy of 44.12% in computer vision, compared to Google Scholar's 15.69%, and an increase to 36.21% in quantum physics, while that of Google Scholar is 12.96%.

View on arXiv PDF

Similar