CLNov 13, 2023

Multilingual Nonce Dependency Treebanks: Understanding how Language Models represent and process syntactic structure

arXiv:2311.07497v232 citationsh-index: 21
Originality Incremental advance
AI Analysis

This work addresses the need for better tools to analyze syntactic representations in language models, which is incremental as it builds on existing methods for creating nonce data and probing techniques.

The authors tackled the problem of understanding how language models represent and process syntactic structure by introducing SPUD, a framework for creating multilingual nonce treebanks, and found that autoregressive language models are more affected by nonce data than masked ones, with syntactic probes retaining most performance, indicating syntax learning independent of semantics.

We introduce SPUD (Semantically Perturbed Universal Dependencies), a framework for creating nonce treebanks for the multilingual Universal Dependencies (UD) corpora. SPUD data satisfies syntactic argument structure, provides syntactic annotations, and ensures grammaticality via language-specific rules. We create nonce data in Arabic, English, French, German, and Russian, and demonstrate two use cases of SPUD treebanks. First, we investigate the effect of nonce data on word co-occurrence statistics, as measured by perplexity scores of autoregressive (ALM) and masked language models (MLM). We find that ALM scores are significantly more affected by nonce data than MLM scores. Second, we show how nonce data affects the performance of syntactic dependency probes. We replicate the findings of Müller-Eberstein et al. (2022) on nonce test data and show that the performance declines on both MLMs and ALMs wrt. original test data. However, a majority of the performance is kept, suggesting that the probe indeed learns syntax independently from semantics.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes