CLFeb 1, 2023

Are UD Treebanks Getting More Consistent? A Report Card for English UD

arXiv:2302.00636v128.9289 citationsh-index: 42

Originality Synthesis-oriented

AI Analysis

This work addresses data consistency issues for NLP researchers using UD treebanks, but it is incremental as it focuses on reporting progress rather than introducing new methods.

The study examined whether Universal Dependencies English treebanks are becoming more consistent over time, finding that while consolidation has progressed, joint models still face inconsistencies that limit their ability to use larger training datasets effectively.

Recent efforts to consolidate guidelines and treebanks in the Universal Dependencies project raise the expectation that joint training and dataset comparison is increasingly possible for high-resource languages such as English, which have multiple corpora. Focusing on the two largest UD English treebanks, we examine progress in data consolidation and answer several questions: Are UD English treebanks becoming more internally consistent? Are they becoming more like each other and to what extent? Is joint training a good idea, and if so, since which UD version? Our results indicate that while consolidation has made progress, joint models may still suffer from inconsistencies, which hamper their ability to leverage a larger pool of training data.

View on arXiv PDF

Similar