CalibreNet: Calibration Networks for Multilingual Sequence Labeling
It addresses boundary accuracy issues in multilingual sequence labeling, particularly for low-resource languages, with an incremental improvement over existing methods.
The paper tackles the problem of boundary errors in sequence labeling tasks like NER and MRC for low-resource languages by proposing CalibreNet, a two-step method that refines initial predictions, achieving state-of-the-art results on zero-shot cross-lingual benchmarks.
Lack of training data in low-resource languages presents huge challenges to sequence labeling tasks such as named entity recognition (NER) and machine reading comprehension (MRC). One major obstacle is the errors on the boundary of predicted answers. To tackle this problem, we propose CalibreNet, which predicts answers in two steps. In the first step, any existing sequence labeling method can be adopted as a base model to generate an initial answer. In the second step, CalibreNet refines the boundary of the initial answer. To tackle the challenge of lack of training data in low-resource languages, we dedicatedly develop a novel unsupervised phrase boundary recovery pre-training task to enhance the multilingual boundary detection capability of CalibreNet. Experiments on two cross-lingual benchmark datasets show that the proposed approach achieves SOTA results on zero-shot cross-lingual NER and MRC tasks.