Pairing Orthographically Variant Literary Words to Standard Equivalents Using Neural Edit Distance Models
This work addresses a domain-specific challenge in computational linguistics for literary text processing, but it is incremental as it applies existing methods to a new dataset.
The paper tackles the problem of pairing orthographically variant words from 19th century U.S. literature with their standard equivalents using neural edit distance models, achieving performance comparisons with models trained on L2 English learner errors and analyzing negative sample generation strategies.
We present a novel corpus consisting of orthographically variant words found in works of 19th century U.S. literature annotated with their corresponding "standard" word pair. We train a set of neural edit distance models to pair these variants with their standard forms, and compare the performance of these models to the performance of a set of neural edit distance models trained on a corpus of orthographic errors made by L2 English learners. Finally, we analyze the relative performance of these models in the light of different negative training sample generation strategies, and offer concluding remarks on the unique challenge literary orthographic variation poses to string pairing methodologies.