CLJul 10, 2023

Enhancing Cross-lingual Transfer via Phonemic Transcription Integration

Hoang H. Nguyen, Chenwei Zhang, Tao Zhang, Eugene Rohrbaugh, Philip S. Yu

Amazon

arXiv:2307.04361v126.6225 citationsh-index: 22Has Code

Originality Highly original

AI Analysis

This work addresses a specific bottleneck in cross-lingual transfer for languages with diverse scripts, offering a novel method to enhance performance in understudied language groups.

The paper tackles the problem of cross-lingual transfer being limited by orthographic representations, which biases against languages with different scripts, by proposing PhoneXL, a framework that integrates phonemic transcriptions as an additional modality. This approach leads to consistent improvements on cross-lingual token-level tasks like Named Entity Recognition and Part-of-Speech Tagging for CJKV languages.

Previous cross-lingual transfer methods are restricted to orthographic representation learning via textual scripts. This limitation hampers cross-lingual transfer and is biased towards languages sharing similar well-known scripts. To alleviate the gap between languages from different writing scripts, we propose PhoneXL, a framework incorporating phonemic transcriptions as an additional linguistic modality beyond the traditional orthographic transcriptions for cross-lingual transfer. Particularly, we propose unsupervised alignment objectives to capture (1) local one-to-one alignment between the two different modalities, (2) alignment via multi-modality contexts to leverage information from additional modalities, and (3) alignment via multilingual contexts where additional bilingual dictionaries are incorporated. We also release the first phonemic-orthographic alignment dataset on two token-level tasks (Named Entity Recognition and Part-of-Speech Tagging) among the understudied but interconnected Chinese-Japanese-Korean-Vietnamese (CJKV) languages. Our pilot study reveals phonemic transcription provides essential information beyond the orthography to enhance cross-lingual transfer and bridge the gap among CJKV languages, leading to consistent improvements on cross-lingual token-level tasks over orthographic-based multilingual PLMs.

View on arXiv PDF Code

Similar