CLDec 12, 2024

Mining Word Boundaries from Speech-Text Parallel Data for Cross-domain Chinese Word Segmentation

Xuebin Wang, Lei Zhang, Zhenghua Li, Shilin Zhou, Chen Gong, Yang Hou

arXiv:2412.09045v111.519 citationsh-index: 4Has CodeCOLING

Originality Incremental advance

AI Analysis

This addresses cross-domain Chinese word segmentation for NLP applications, but it is incremental as it builds on existing speech-text integration methods.

This paper tackles cross-domain Chinese word segmentation by mining word boundaries from speech-text parallel data using forced alignment and filtering strategies, achieving effective results on ZX and AISHELL2 domains with 1,000 annotated sentences for evaluation.

Inspired by early research on exploring naturally annotated data for Chinese Word Segmentation (CWS), and also by recent research on integration of speech and text processing, this work for the first time proposes to explicitly mine word boundaries from speech-text parallel data. We employ the Montreal Forced Aligner (MFA) toolkit to perform character-level alignment on speech-text data, giving pauses as candidate word boundaries. Based on detailed analysis of collected pauses, we propose an effective probability-based strategy for filtering unreliable word boundaries. To more effectively utilize word boundaries as extra training data, we also propose a robust complete-then-train (CTT) strategy. We conduct cross-domain CWS experiments on two target domains, i.e., ZX and AISHELL2. We have annotated about 1,000 sentences as the evaluation data of AISHELL2. Experiments demonstrate the effectiveness of our proposed approach.

View on arXiv PDF Code

Similar