CLAILGApr 19, 2024

CORI: CJKV Benchmark with Romanization Integration -- A step towards Cross-lingual Transfer Beyond Textual Scripts

Amazon
arXiv:2404.12618v12 citationsh-index: 22LREC
Originality Incremental advance
AI Analysis

This addresses cross-lingual transfer challenges for CJKV languages, which is incremental as it builds on existing methods by incorporating language contact and Romanization.

The paper tackled the problem of cross-lingual transfer by showing that selecting source languages with high contact to the target language improves performance, and introduced a benchmark for CJKV languages with Romanization integration, achieving enhanced representations and effective zero-shot transfer.

Naively assuming English as a source language may hinder cross-lingual transfer for many languages by failing to consider the importance of language contact. Some languages are more well-connected than others, and target languages can benefit from transferring from closely related languages; for many languages, the set of closely related languages does not include English. In this work, we study the impact of source language for cross-lingual transfer, demonstrating the importance of selecting source languages that have high contact with the target language. We also construct a novel benchmark dataset for close contact Chinese-Japanese-Korean-Vietnamese (CJKV) languages to further encourage in-depth studies of language contact. To comprehensively capture contact between these languages, we propose to integrate Romanized transcription beyond textual scripts via Contrastive Learning objectives, leading to enhanced cross-lingual representations and effective zero-shot cross-lingual transfer.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes