CLOct 18, 2018

Large-scale Hierarchical Alignment for Data-driven Text Rewriting

arXiv:1810.08237v21004 citations
Originality Incremental advance
AI Analysis

This addresses the need for scalable data-driven text rewriting, such as simplification, but is incremental as it builds on existing embedding techniques.

The paper tackles the problem of extracting pseudo-parallel monolingual sentence pairs from comparable corpora without a seed parallel corpus, using hierarchical search over pre-trained embeddings, and shows that this method achieves competitive performance in text simplification from normal to Simple Wikipedia.

We propose a simple unsupervised method for extracting pseudo-parallel monolingual sentence pairs from comparable corpora representative of two different text styles, such as news articles and scientific papers. Our approach does not require a seed parallel corpus, but instead relies solely on hierarchical search over pre-trained embeddings of documents and sentences. We demonstrate the effectiveness of our method through automatic and extrinsic evaluation on text simplification from the normal to the Simple Wikipedia. We show that pseudo-parallel sentences extracted with our method not only supplement existing parallel data, but can even lead to competitive performance on their own.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes