CLDLJun 16, 2013

An open diachronic corpus of historical Spanish: annotation criteria and automatic modernisation of spelling

arXiv:1306.3692v232 citations
Originality Synthesis-oriented
AI Analysis

This provides a valuable resource for linguistic research on historical Spanish, though it is incremental as it builds on existing corpus and machine translation methods.

The authors tackled the problem of historical Spanish text analysis by creating an open diachronic corpus with 8 million words and a lexicon of 10,000 lemmas, and they applied statistical machine translation for automatic spelling modernization, achieving very low character error rates.

The IMPACT-es diachronic corpus of historical Spanish compiles over one hundred books --containing approximately 8 million words-- in addition to a complementary lexicon which links more than 10 thousand lemmas with attestations of the different variants found in the documents. This textual corpus and the accompanying lexicon have been released under an open license (Creative Commons by-nc-sa) in order to permit their intensive exploitation in linguistic research. Approximately 7% of the words in the corpus (a selection aimed at enhancing the coverage of the most frequent word forms) have been annotated with their lemma, part of speech, and modern equivalent. This paper describes the annotation criteria followed and the standards, based on the Text Encoding Initiative recommendations, used to the represent the texts in digital form. As an illustration of the possible synergies between diachronic textual resources and linguistic research, we describe the application of statistical machine translation techniques to infer probabilistic context-sensitive rules for the automatic modernisation of spelling. The automatic modernisation with this type of statistical methods leads to very low character error rates when the output is compared with the supervised modern version of the text.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes