IRDec 21, 2020

Cross-domain Retrieval in the Legal and Patent Domains: a Reproducibility Study

arXiv:2012.11405v223 citations
AI Analysis

This study addresses the reproducibility and effectiveness of a specific pretrained language model framework (BERT-PLI) for legal and patent document retrieval, which is important for researchers and practitioners in these domain-specific search areas. It is an incremental study.

This paper attempts to reproduce the BERT-PLI framework for cross-domain retrieval in legal and patent domains. The authors found that the original BERT-PLI's domain-specific paragraph-level modeling did not improve performance over standard BERT and did not outperform the BM25 baseline in patent retrieval. However, they observed promising results for cross-domain transfer on the document-level.

Domain specific search has always been a challenging information retrieval task due to several challenges such as the domain specific language, the unique task setting, as well as the lack of accessible queries and corresponding relevance judgements. In the last years, pretrained language models, such as BERT, revolutionized web and news search. Naturally, the community aims to adapt these advancements to cross-domain transfer of retrieval models for domain specific search. In the context of legal document retrieval, Shao et al. propose the BERT-PLI framework by modeling the Paragraph Level Interactions with the language model BERT. In this paper we reproduce the original experiments, we clarify pre-processing steps, add missing scripts for framework steps and investigate different evaluation approaches, however we are not able to reproduce the evaluation results. Contrary to the original paper, we demonstrate that the domain specific paragraph-level modelling does not appear to help the performance of the BERT-PLI model compared to paragraph-level modelling with the original BERT. In addition to our legal search reproducibility study, we investigate BERT-PLI for document retrieval in the patent domain. We find that the BERT-PLI model does not yet achieve performance improvements for patent document retrieval compared to the BM25 baseline. Furthermore, we evaluate the BERT-PLI model for cross-domain retrieval between the legal and patent domain on individual components, both on a paragraph and document-level. We find that the transfer of the BERT-PLI model on the paragraph-level leads to comparable results between both domains as well as first promising results for the cross-domain transfer on the document-level. For reproducibility and transparency as well as to benefit the community we make our source code and the trained models publicly available.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes