CLAug 17, 2021

Not All Linearizations Are Equally Data-Hungry in Sequence Labeling Parsing

Alberto Muñoz-Ortiz, Michalina Strzyz, David Vilares

arXiv:2108.07556v130.7654 citations

Originality Incremental advance

AI Analysis

This work addresses data efficiency in dependency parsing for NLP researchers, providing insights into method selection for low-resource scenarios, though it is incremental as it compares existing linearizations.

The study investigated how different linearizations for dependency parsing as sequence labeling perform in low-resource setups, finding that head selection encodings are more data-efficient in ideal conditions, but bracketing formats become more advantageous in real-world low-resource configurations.

Different linearizations have been proposed to cast dependency parsing as sequence labeling and solve the task as: (i) a head selection problem, (ii) finding a representation of the token arcs as bracket strings, or (iii) associating partial transition sequences of a transition-based parser to words. Yet, there is little understanding about how these linearizations behave in low-resource setups. Here, we first study their data efficiency, simulating data-restricted setups from a diverse set of rich-resource treebanks. Second, we test whether such differences manifest in truly low-resource setups. The results show that head selection encodings are more data-efficient and perform better in an ideal (gold) framework, but that such advantage greatly vanishes in favour of bracketing formats when the running setup resembles a real-world low-resource configuration.

View on arXiv PDF

Similar