CLAIOct 19, 2020

PySBD: Pragmatic Sentence Boundary Disambiguation

arXiv:2010.09657v11003 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This provides a practical, out-of-the-box solution for sentence segmentation across multiple languages, though it is incremental as it ports and improves an existing Ruby implementation.

The authors tackled sentence boundary disambiguation by developing PySBD, a rule-based Python package that works for 22 languages and achieves 97.92% accuracy on English test exemplars, a 25% improvement over existing open-source tools.

In this paper, we present a rule-based sentence boundary disambiguation Python package that works out-of-the-box for 22 languages. We aim to provide a realistic segmenter which can provide logical sentences even when the format and domain of the input text is unknown. In our work, we adapt the Golden Rules Set (a language-specific set of sentence boundary exemplars) originally implemented as a ruby gem - pragmatic_segmenter - which we ported to Python with additional improvements and functionality. PySBD passes 97.92% of the Golden Rule Set exemplars for English, an improvement of 25% over the next best open-source Python tool.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes