Assessment of Pre-Trained Models Across Languages and Grammars
This work provides insights into syntax learning in LLMs for NLP researchers, but it is incremental as it applies existing methods to new data.
The study tackled the problem of assessing how multilingual large language models learn syntax by recovering constituent and dependency structures from 13 UD treebanks for dependency parsing and 10 for constituent parsing, finding that sub-word tokenization is necessary and language presence in pretraining data is more critical than task data amount.
We present an approach for assessing how multilingual large language models (LLMs) learn syntax in terms of multi-formalism syntactic structures. We aim to recover constituent and dependency structures by casting parsing as sequence labeling. To do so, we select a few LLMs and study them on 13 diverse UD treebanks for dependency parsing and 10 treebanks for constituent parsing. Our results show that: (i) the framework is consistent across encodings, (ii) pre-trained word vectors do not favor constituency representations of syntax over dependencies, (iii) sub-word tokenization is needed to represent syntax, in contrast to character-based models, and (iv) occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors.