CLMay 27, 2025

Contrastive Learning on LLM Back Generation Treebank for Cross-domain Constituency Parsing

arXiv:2505.20976v11 citationsh-index: 10ACL
Originality Incremental advance
AI Analysis

This addresses the problem of cross-domain constituency parsing for computational linguists, offering an incremental improvement in treebank generation.

The paper tackles the challenge of limited multi-domain constituency treebanks by proposing LLM back generation to automatically create cross-domain constituency treebanks, achieving state-of-the-art performance on five target domains of MCTB.

Cross-domain constituency parsing is still an unsolved challenge in computational linguistics since the available multi-domain constituency treebank is limited. We investigate automatic treebank generation by large language models (LLMs) in this paper. The performance of LLMs on constituency parsing is poor, therefore we propose a novel treebank generation method, LLM back generation, which is similar to the reverse process of constituency parsing. LLM back generation takes the incomplete cross-domain constituency tree with only domain keyword leaf nodes as input and fills the missing words to generate the cross-domain constituency treebank. Besides, we also introduce a span-level contrastive learning pre-training strategy to make full use of the LLM back generation treebank for cross-domain constituency parsing. We verify the effectiveness of our LLM back generation treebank coupled with contrastive learning pre-training on five target domains of MCTB. Experimental results show that our approach achieves state-of-the-art performance on average results compared with various baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes