CLFeb 13, 2019

Leveraging Newswire Treebanks for Parsing Conversational Data with Argument Scrambling

Riyaz Ahmad Bhat, Irshad Ahmad Bhat, Dipti Misra Sharma

arXiv:1902.05085v131.01087 citations

Originality Synthesis-oriented

AI Analysis

This addresses parsing challenges in morphologically-rich languages for conversational applications, but is incremental as it adapts existing methods to new data types.

The paper tackled parsing conversational Hindi data with argument scrambling by showing that a parser trained on newswire treebanks degrades on such data, and improved performance by 9% LAS using transformed structures from generative grammar.

We investigate the problem of parsing conversational data of morphologically-rich languages such as Hindi where argument scrambling occurs frequently. We evaluate a state-of-the-art non-linear transition-based parsing system on a new dataset containing 506 dependency trees for sentences from Bollywood (Hindi) movie scripts and Twitter posts of Hindi monolingual speakers. We show that a dependency parser trained on a newswire treebank is strongly biased towards the canonical structures and degrades when applied to conversational data. Inspired by Transformational Generative Grammar, we mitigate the sampling bias by generating all theoretically possible alternative word orders of a clause from the existing (kernel) structures in the treebank. Training our parser on canonical and transformed structures improves performance on conversational data by around 9% LAS over the baseline newswire parser.

View on arXiv PDF

Similar