ALBERTI, a Multilingual Domain Specific Language Model for Poetry Analysis
This addresses the problem of enabling efficient comparative poetry studies across languages for researchers, though it is incremental as it builds on existing multilingual BERT with domain-specific adaptation.
The paper tackled the scarcity of multilingual tools for poetry analysis by introducing ALBERTI, a domain-specific pre-trained language model trained on 12 million verses from 12 languages, which outperformed existing models and achieved state-of-the-art results in tasks like metrical pattern prediction for German.
The computational analysis of poetry is limited by the scarcity of tools to automatically analyze and scan poems. In a multilingual settings, the problem is exacerbated as scansion and rhyme systems only exist for individual languages, making comparative studies very challenging and time consuming. In this work, we present \textsc{Alberti}, the first multilingual pre-trained large language model for poetry. Through domain-specific pre-training (DSP), we further trained multilingual BERT on a corpus of over 12 million verses from 12 languages. We evaluated its performance on two structural poetry tasks: Spanish stanza type classification, and metrical pattern prediction for Spanish, English and German. In both cases, \textsc{Alberti} outperforms multilingual BERT and other transformers-based models of similar sizes, and even achieves state-of-the-art results for German when compared to rule-based systems, demonstrating the feasibility and effectiveness of DSP in the poetry domain.