Simplifying the Bible and Wikipedia Using Statistical Machine Translation
This work addresses text simplification for accessibility, but it is incremental as it applies existing SMT methods to new datasets without major methodological innovations.
The study tackled text simplification by applying statistical machine translation (SMT) techniques to parallel corpora of the Bible and Wikipedia, reporting results in terms of METEOR and BLEU scores. It also explored generating text in the King James style as part of a broader goal for linguistic style imitation.
I started this work with the hope of generating a text synthesizer (like a musical synthesizer) that can imitate certain linguistic styles. Most of the report focuses on text simplification using statistical machine translation (SMT) techniques. I applied MOSES to a parallel corpus of the Bible (King James Version and Easy-to-Read Version) and that of Wikipedia articles (normal and simplified). I report the importance of the three main components of SMT---phrase translation, language model, and recording---by changing their weights and comparing the resulting quality of simplified text in terms of METEOR and BLEU. Toward the end of the report will be presented some examples of text "synthesized" into the King James style.