CLApr 27, 2018

Extracting Parallel Paragraphs from Common Crawl

arXiv:1804.10413v12 citations
Originality Incremental advance
AI Analysis

This work addresses the need for more parallel text data in machine translation and NLP by enabling extraction from unstructured web sources, though it is incremental as it builds on existing techniques like word2vec.

The authors tackled the problem of extracting parallel paragraphs from unstructured web data, proposing a method that combines bivec and locality-sensitive hashing to efficiently identify parallel segments across diverse sources, and validated it on real-world data scaling to hundreds of terabytes.

Most of the current methods for mining parallel texts from the web assume that web pages of web sites share same structure across languages. We believe that there still exists a non-negligible amount of parallel data spread across sources not satisfying this assumption. We propose an approach based on a combination of bivec (a bilingual extension of word2vec) and locality-sensitive hashing which allows us to efficiently identify pairs of parallel segments located anywhere on pages of a given web domain, regardless their structure. We validate our method on realigning segments from a large parallel corpus. Another experiment with real-world data provided by Common Crawl Foundation confirms that our solution scales to hundreds of terabytes large set of web-crawled data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes