Multi-stage Training of Bilingual Islamic LLM for Neural Passage Retrieval
This work addresses retrieval challenges in the Islamic domain, where data is limited in languages like English, but it is incremental as it builds on existing models and methods.
This study tackled the problem of neural passage retrieval in the Islamic domain by developing a bilingual large language model using a multi-stage training process, resulting in performance gains that outperformed monolingual models on downstream retrieval tasks.
This study examines the use of Natural Language Processing (NLP) technology within the Islamic domain, focusing on developing an Islamic neural retrieval model. By leveraging the robust XLM-R model, the research employs a language reduction technique to create a lightweight bilingual large language model (LLM). Our approach for domain adaptation addresses the unique challenges faced in the Islamic domain, where substantial in-domain corpora exist only in Arabic while limited in other languages, including English. The work utilizes a multi-stage training process for retrieval models, incorporating large retrieval datasets, such as MS MARCO, and smaller, in-domain datasets to improve retrieval performance. Additionally, we have curated an in-domain retrieval dataset in English by employing data augmentation techniques and involving a reliable Islamic source. This approach enhances the domain-specific dataset for retrieval, leading to further performance gains. The findings suggest that combining domain adaptation and a multi-stage training method for the bilingual Islamic neural retrieval model enables it to outperform monolingual models on downstream retrieval tasks.