Cross-Domain Evaluation of POS Taggers: From Wall Street Journal to Fandom Wiki
This work highlights the limitations of POS taggers in out-of-domain settings, particularly for NLP researchers and practitioners dealing with specialized or informal text data, but it is incremental as it applies existing methods to new data without introducing novel techniques.
The study evaluated the cross-domain performance of two POS taggers, Stanford and Bilty, trained on Wall Street Journal data, when applied to a Fandom Wiki dataset, finding that accuracy on unknown tokens dropped significantly, from 90.37% to 78.37% for Stanford and 87.84% to 80.41% for Bilty.
The Wall Street Journal section of the Penn Treebank has been the de-facto standard for evaluating POS taggers for a long time, and accuracies over 97\% have been reported. However, less is known about out-of-domain tagger performance, especially with fine-grained label sets. Using data from Elder Scrolls Fandom, a wiki about the \textit{Elder Scrolls} video game universe, we create a modest dataset for qualitatively evaluating the cross-domain performance of two POS taggers: the Stanford tagger (Toutanova et al. 2003) and Bilty (Plank et al. 2016), both trained on WSJ. Our analyses show that performance on tokens seen during training is almost as good as in-domain performance, but accuracy on unknown tokens decreases from 90.37% to 78.37% (Stanford) and 87.84\% to 80.41\% (Bilty) across domains. Both taggers struggle with proper nouns and inconsistent capitalization.