A Data-Oriented Model of Literary Language
This work addresses the problem of quantifying literary language for researchers in computational linguistics and digital humanities, representing an incremental advance with specific domain application.
The paper tackled the task of predicting literary quality in texts using human ratings as a gold standard, achieving a model that explains 76.0% of the variation in literary ratings by combining bigram baselines, syntactic tree fragments, and hand-picked features.
We consider the task of predicting how literary a text is, with a gold standard from human ratings. Aside from a standard bigram baseline, we apply rich syntactic tree fragments, mined from the training set, and a series of hand-picked features. Our model is the first to distinguish degrees of highly and less literary novels using a variety of lexical and syntactic features, and explains 76.0 % of the variation in literary ratings.