PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods
This work addresses the problem of evaluating and improving text retrieval methods for the Polish language, which is incremental as it builds on existing techniques but applies them to a new linguistic context.
The authors tackled the lack of a comprehensive evaluation framework for Polish text retrieval by creating PIRB, a benchmark with 41 tasks including 10 new datasets, and introduced a three-step training process that resulted in dense models outperforming existing methods and hybrid methods further improving performance.
We present Polish Information Retrieval Benchmark (PIRB), a comprehensive evaluation framework encompassing 41 text information retrieval tasks for Polish. The benchmark incorporates existing datasets as well as 10 new, previously unpublished datasets covering diverse topics such as medicine, law, business, physics, and linguistics. We conduct an extensive evaluation of over 20 dense and sparse retrieval models, including the baseline models trained by us as well as other available Polish and multilingual methods. Finally, we introduce a three-step process for training highly effective language-specific retrievers, consisting of knowledge distillation, supervised fine-tuning, and building sparse-dense hybrid retrievers using a lightweight rescoring model. In order to validate our approach, we train new text encoders for Polish and compare their results with previously evaluated methods. Our dense models outperform the best solutions available to date, and the use of hybrid methods further improves their performance.