Citegeist: Automated Generation of Related Work Analysis on the arXiv Corpus
This tool addresses the need for researchers to efficiently produce accurate related work sections, though it is incremental as it builds on existing RAG and embedding methods.
The authors tackled the problem of LLMs hallucinating sources and lacking access to scientific articles by developing Citegeist, a pipeline using dynamic RAG on the arXiv corpus to generate citation-backed related work sections, achieving automated generation with multi-stage filtering and optimized updates for new papers.
Large Language Models provide significant new opportunities for the generation of high-quality written works. However, their employment in the research community is inhibited by their tendency to hallucinate invalid sources and lack of direct access to a knowledge base of relevant scientific articles. In this work, we present Citegeist: An application pipeline using dynamic Retrieval Augmented Generation (RAG) on the arXiv Corpus to generate a related work section and other citation-backed outputs. For this purpose, we employ a mixture of embedding-based similarity matching, summarization, and multi-stage filtering. To adapt to the continuous growth of the document base, we also present an optimized way of incorporating new and modified papers. To enable easy utilization in the scientific community, we release both, a website (https://citegeist.org), as well as an implementation harness that works with several different LLM implementations.