Detecting Spelling and Grammatical Anomalies in Russian Poetry Texts
This work addresses data quality issues for researchers and engineers in computational creativity, though it is incremental as it applies existing detection methods to a new domain-specific dataset.
The paper tackled the problem of low-quality training data for generative models in creative domains by comparing unsupervised and supervised anomaly detection methods to identify spelling and grammatical errors in Russian poetry texts, resulting in the introduction of the RUPOR dataset and evaluation code for community use.
The quality of natural language texts in fine-tuning datasets plays a critical role in the performance of generative models, particularly in computational creativity tasks such as poem or song lyric generation. Fluency defects in generated poems significantly reduce their value. However, training texts are often sourced from internet-based platforms without stringent quality control, posing a challenge for data engineers to manage defect levels effectively. To address this issue, we propose the use of automated linguistic anomaly detection to identify and filter out low-quality texts from training datasets for creative models. In this paper, we present a comprehensive comparison of unsupervised and supervised text anomaly detection approaches, utilizing both synthetic and human-labeled datasets. We also introduce the RUPOR dataset, a collection of Russian-language human-labeled poems designed for cross-sentence grammatical error detection, and provide the full evaluation code. Our work aims to empower the community with tools and insights to improve the quality of training datasets for generative models in creative domains.