SentAlign: Accurate and Scalable Sentence Alignment
This provides a scalable solution for sentence alignment in natural language processing, though it is incremental as it builds on existing bilingual representations.
The authors tackled the problem of aligning sentences in large parallel documents by developing SentAlign, which outperformed five other tools on German-French and English-Icelandic datasets and a machine translation task.
We present SentAlign, an accurate sentence alignment tool designed to handle very large parallel document pairs. Given user-defined parameters, the alignment algorithm evaluates all possible alignment paths in fairly large documents of thousands of sentences and uses a divide-and-conquer approach to align documents containing tens of thousands of sentences. The scoring function is based on LaBSE bilingual sentence representations. SentAlign outperforms five other sentence alignment tools when evaluated on two different evaluation sets, German-French and English-Icelandic, and on a downstream machine translation task.