Punctuation-aware treebank tree binarization
This addresses a specific technical bottleneck in NLP treebank processing for researchers working on syntactic parsing and tree transformations.
The authors tackled the problem of punctuation being dropped in standard treebank binarization pipelines, which harms head-child identification, by developing a punctuation-aware preprocessing method that improves head prediction accuracy on the Penn Treebank from 73.66% (Collins rules) and 86.66% (MLP) to 91.85% with the same classifier.
This article presents a curated resource and evaluation suite for punctuation-aware treebank binarization. Standard binarization pipelines drop punctuation before head selection, which alters constituent shape and harms head-child identification. We release (1) a reproducible pipeline that preserves punctuation as sibling nodes prior to binarization, (2) derived artifacts and metadata (intermediate @X markers, reversibility signatures, alignment indices), and (3) an accompanying evaluation suite covering head-child prediction, round-trip reversibility, and structural compatibility with derivational resources (CCGbank). On the Penn Treebank, punctuation-aware preprocessing improves head prediction accuracy from 73.66\% (Collins rules) and 86.66\% (MLP) to 91.85\% with the same classifier, and achieves competitive alignment against CCGbank derivations. All code, configuration files, and documentation are released to enable replication and extension to other corpora.