CL LGMay 17, 2020

Encodings of Source Syntax: Similarities in NMT Representations Across Target Languages

arXiv:2005.08177v131.0996 citations

Originality Incremental advance

AI Analysis

This work addresses the understanding of syntax learning in NMT models for computational linguistics, but it is incremental as it builds on existing research about encoder representations.

The study investigated whether neural machine translation (NMT) encoders learn similar source syntax across different target languages, finding that they do and rely on morphosyntactic cues, with NMT encoders outperforming RNNs on some constituent label prediction tasks. However, both NMT encoders and RNNs learned different syntax from a PCFG parser, which performed better on sentences where RNNs struggled, indicating architectural constraints in RNNs.

We train neural machine translation (NMT) models from English to six target languages, using NMT encoder representations to predict ancestor constituent labels of source language words. We find that NMT encoders learn similar source syntax regardless of NMT target language, relying on explicit morphosyntactic cues to extract syntactic features from source sentences. Furthermore, the NMT encoders outperform RNNs trained directly on several of the constituent label prediction tasks, suggesting that NMT encoder representations can be used effectively for natural language tasks involving syntax. However, both the NMT encoders and the directly-trained RNNs learn substantially different syntactic information from a probabilistic context-free grammar (PCFG) parser. Despite lower overall accuracy scores, the PCFG often performs well on sentences for which the RNN-based models perform poorly, suggesting that RNN architectures are constrained in the types of syntax they can learn.

View on arXiv PDF

Similar