CLMar 15, 2021

Multi-view Subword Regularization

arXiv:2103.08490v2744 citations
AI Analysis

This addresses a bottleneck in multilingual NLP by improving segmentation for better cross-lingual transfer, though it is incremental as it builds on existing subword regularization methods.

The paper tackles the problem of sub-optimal subword segmentation in multilingual pretrained representations, which hinders cross-lingual transfer, by proposing Multi-view Subword Regularization (MVR) to enforce consistency between predictions from different segmentations, resulting in improvements of up to 2.5 points on the XTREME benchmark.

Multilingual pretrained representations generally rely on subword segmentation algorithms to create a shared multilingual vocabulary. However, standard heuristic algorithms often lead to sub-optimal segmentation, especially for languages with limited amounts of data. In this paper, we take two major steps towards alleviating this problem. First, we demonstrate empirically that applying existing subword regularization methods(Kudo, 2018; Provilkov et al., 2020) during fine-tuning of pre-trained multilingual representations improves the effectiveness of cross-lingual transfer. Second, to take full advantage of different possible input segmentations, we propose Multi-view Subword Regularization (MVR), a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations. Results on the XTREME multilingual benchmark(Hu et al., 2020) show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes