CLOct 10, 2017

The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages

arXiv:1710.03838v11106 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the data scarcity issue for NLP researchers working on low-resource language adaptation, though it is incremental as it builds on existing treebank synthesis methods.

The authors tackled the problem of limited training data for NLP methods adapting to unfamiliar languages by releasing Galactic Dependencies 1.0, a set of synthetic languages annotated in Universal Dependencies format, and found that including these synthetic source languages in single-source transfer parsing significantly improved results for most target languages.

We release Galactic Dependencies 1.0---a large set of synthetic languages not found on Earth, but annotated in Universal Dependencies format. This new resource aims to provide training and development data for NLP methods that aim to adapt to unfamiliar languages. Each synthetic treebank is produced from a real treebank by stochastically permuting the dependents of nouns and/or verbs to match the word order of other real languages. We discuss the usefulness, realism, parsability, perplexity, and diversity of the synthetic languages. As a simple demonstration of the use of Galactic Dependencies, we consider single-source transfer, which attempts to parse a real target language using a parser trained on a "nearby" source language. We find that including synthetic source languages somewhat increases the diversity of the source pool, which significantly improves results for most target languages.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes