CLOct 18, 2018

Large-scale Hierarchical Alignment for Data-driven Text Rewriting

Nikola I. Nikolov, Richard H. R. Hahnloser

arXiv:1810.08237v231.11004 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the need for scalable data-driven text rewriting, such as simplification, but is incremental as it builds on existing embedding techniques.

The paper tackles the problem of extracting pseudo-parallel monolingual sentence pairs from comparable corpora without a seed parallel corpus, using hierarchical search over pre-trained embeddings, and shows that this method achieves competitive performance in text simplification from normal to Simple Wikipedia.

We propose a simple unsupervised method for extracting pseudo-parallel monolingual sentence pairs from comparable corpora representative of two different text styles, such as news articles and scientific papers. Our approach does not require a seed parallel corpus, but instead relies solely on hierarchical search over pre-trained embeddings of documents and sentences. We demonstrate the effectiveness of our method through automatic and extrinsic evaluation on text simplification from the normal to the Simple Wikipedia. We show that pseudo-parallel sentences extracted with our method not only supplement existing parallel data, but can even lead to competitive performance on their own.

View on arXiv PDF Code

Similar