CLAILGMay 2, 2023

MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset

arXiv:2305.01211v112 citations
Originality Synthesis-oriented
AI Analysis

This addresses the challenge of accurate sentence segmentation for NLP applications in the legal domain, which is incremental as it focuses on dataset creation and model adaptation for a specific domain.

The authors tackled the problem of sentence boundary detection in multilingual legal texts by curating a dataset of over 130,000 annotated sentences in 6 languages, and they demonstrated state-of-the-art performance with models that outperform baselines in zero-shot settings.

Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130'000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the-art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by the community, we have made our dataset, models, and code publicly available.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes