CLAILGNov 7, 2024

KnowCoder-X: Boosting Multilingual Information Extraction via Code

ByteDance
arXiv:2411.04794v36 citationsh-index: 50Has CodeACL
Originality Highly original
AI Analysis

This addresses cross-lingual information extraction for multilingual NLP applications, representing a novel method for a known bottleneck.

The paper tackles the problem of multilingual information extraction (IE) imbalance in LLMs by proposing KnowCoder-X, a code LLM that standardizes multilingual schemas with Python classes and formulates IE as a code generation task. It achieves a 30.17% improvement over ChatGPT and 20.03% over state-of-the-art methods on 64 benchmarks, despite not being trained in 29 unseen languages.

Empirical evidence indicates that LLMs exhibit spontaneous cross-lingual alignment. However, although LLMs show promising cross-lingual alignment in Information Extraction (IE), a significant imbalance across languages persists, highlighting an underlying deficiency. To address this, we propose KnowCoder-X, a powerful code LLM with advanced cross-lingual and multilingual capabilities for universal IE. Firstly, it standardizes the representation of multilingual schemas using Python classes, ensuring a consistent ontology across different languages. Then, IE across languages is formulated as a unified code generation task. Secondly, we conduct IE cross-lingual alignment instruction tuning on the translated instance prediction task to enhance the model's cross-lingual transferability. During this phase, we also construct a high-quality and diverse bilingual IE parallel dataset with 257k samples, called ParallelNER, synthesized by our proposed robust three-stage pipeline, with manual annotation to ensure quality. Although without training in 29 unseen languages, KnowCoder-X surpasses ChatGPT by 30.17\% and SoTA by 20.03\%, thereby demonstrating superior cross-lingual IE capabilities. Comprehensive evaluations on 64 IE benchmarks in Chinese and English under various settings demonstrate that KnowCoder-X significantly enhances cross-lingual IE transfer through boosting the IE alignment. Our code and dataset are available at: https://github.com/ICT-GoKnow/KnowCoder

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes