Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre
This work provides tools for linguistic analysis of classical French literature, particularly for stylometric studies, but is incremental as it applies existing neural and CRF methods to a new domain-specific corpus.
The authors tackled the problem of lemmatization and POS-tagging for classical French theatre by building an annotated corpus and training models, achieving accuracies beyond the current state-of-the-art on in-domain tests and demonstrating robustness in out-of-domain tests up to 20th-century novels.
This paper describes the process of building an annotated corpus and training models for classical French literature, with a focus on theatre, and particularly comedies in verse. It was originally developed as a preliminary step to the stylometric analyses presented in Cafiero and Camps [2019]. The use of a recent lemmatiser based on neural networks and a CRF tagger allows to achieve accuracies beyond the current state-of-the art on the in-domain test, and proves to be robust during out-of-domain tests, i.e.up to 20th c.novels.