CLFeb 24, 2025

A Hybrid Approach to Information Retrieval and Answer Generation for Regulatory Texts

arXiv:2502.16767v115 citationsh-index: 2COLING Workshops
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of information retrieval for regulatory officers in compliance tasks, but it is incremental as it builds on existing techniques like BM25 and LLMs.

The paper tackled the challenge of retrieving information from complex regulatory texts by developing a hybrid system that combines lexical and semantic search with a RAG framework, resulting in significant improvements in Recall@10 and MAP@10 over standalone methods.

Regulatory texts are inherently long and complex, presenting significant challenges for information retrieval systems in supporting regulatory officers with compliance tasks. This paper introduces a hybrid information retrieval system that combines lexical and semantic search techniques to extract relevant information from large regulatory corpora. The system integrates a fine-tuned sentence transformer model with the traditional BM25 algorithm to achieve both semantic precision and lexical coverage. To generate accurate and comprehensive responses, retrieved passages are synthesized using Large Language Models (LLMs) within a Retrieval Augmented Generation (RAG) framework. Experimental results demonstrate that the hybrid system significantly outperforms standalone lexical and semantic approaches, with notable improvements in Recall@10 and MAP@10. By openly sharing our fine-tuned model and methodology, we aim to advance the development of robust natural language processing tools for compliance-driven applications in regulatory domains.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes