CLSep 6, 2013

Preparing Korean Data for the Shared Task on Parsing Morphologically Rich Languages

arXiv:1309.1649v219 citations

Originality Synthesis-oriented

AI Analysis

This work provides standardized data for parsing morphologically rich languages, specifically Korean, supporting the shared task community.

The paper describes the preparation of Korean data for the SPMRL 2013 shared task, involving 27,363 sentences and 350,090 tokens from the KAIST Treebank, with constituent trees transformed to Penn Treebank style and dependency trees derived using heuristics, along with gold-standard and automatic morphological analyses.

This document gives a brief description of Korean data prepared for the SPMRL 2013 shared task. A total of 27,363 sentences with 350,090 tokens are used for the shared task. All constituent trees are collected from the KAIST Treebank and transformed to the Penn Treebank style. All dependency trees are converted from the transformed constituent trees using heuristics and labeling rules de- signed specifically for the KAIST Treebank. In addition to the gold-standard morphological analysis provided by the KAIST Treebank, two sets of automatic morphological analysis are provided for the shared task, one is generated by the HanNanum morphological analyzer, and the other is generated by the Sejong morphological analyzer.

View on arXiv PDF

Similar