Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement
This work addresses the problem of understanding syntactic generalization in multilingual models for NLP researchers, but it is incremental as it builds on existing methods with new data.
The paper investigated whether multilingual pretrained language models capture cross-linguistic syntactic representations by focusing on subject-verb agreement across multiple languages using curated synthetic data. The results showed that these models exhibit language-specific differences and do not share syntactic structure, even among closely related languages.
In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically valid abstract linguistic representations. We take the approach of developing curated synthetic data on a large scale, with specific properties, and using them to study sentence representations built using pretrained language models. We use a new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to focus on a specific grammatical structural phenomenon -- subject-verb agreement across a variety of sentence structures -- in several languages. Finding a solution to this task requires a system detecting complex linguistic patterns and paradigms in text representations. Using a two-level architecture that solves the problem in two steps -- detect syntactic objects and their properties in individual sentences, and find patterns across an input sequence of sentences -- we show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences, and syntactic structure is not shared, even across closely related languages.