CLMay 14, 2016

Capturing divergence in dependency trees to improve syntactic projection

arXiv:1605.04475v16 citations
Originality Incremental advance
AI Analysis

This addresses the issue of building NLP tools for resource-poor languages that lack large annotated datasets, though it is incremental as it builds on existing projection methods.

The paper tackles the problem of syntactic projection for resource-poor languages by automatically detecting divergent structural patterns between languages using small parallel annotated corpora, resulting in improved performance of projection algorithms without prior knowledge of language pairs.

Obtaining syntactic parses is a crucial part of many NLP pipelines. However, most of the world's languages do not have large amounts of syntactically annotated corpora available for building parsers. Syntactic projection techniques attempt to address this issue by using parallel corpora consisting of resource-poor and resource-rich language pairs, taking advantage of a parser for the resource-rich language and word alignment between the languages to project the parses onto the data for the resource-poor language. These projection methods can suffer, however, when the two languages are divergent. In this paper, we investigate the possibility of using small, parallel, annotated corpora to automatically detect divergent structural patterns between two languages. These patterns can then be used to improve structural projection algorithms, allowing for better performing NLP tools for resource-poor languages, in particular those that may not have large amounts of annotated data necessary for traditional, fully-supervised methods. While this detection process is not exhaustive, we demonstrate that common patterns of divergence can be identified automatically without prior knowledge of a given language pair, and the patterns can be used to improve performance of projection algorithms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes