CLSep 13, 2021

Wine is Not v i n. -- On the Compatibility of Tokenizations Across Languages

Antonis Maronikolakis, Philipp Dufter, Hinrich Schütze

arXiv:2109.05772v117 citations

Originality Incremental advance

AI Analysis

This addresses a specific issue in multilingual NLP by improving tokenization compatibility, though it is incremental as it builds on existing subword methods.

The paper tackled the problem of incompatible tokenizations across languages in multilingual language models, which hinders learning good semantic representations, and proposed a compatibility measure that allows designers to create more compatible vocabularies.

The size of the vocabulary is a central design choice in large pretrained language models, with respect to both performance and memory requirements. Typically, subword tokenization algorithms such as byte pair encoding and WordPiece are used. In this work, we investigate the compatibility of tokenizations for multilingual static and contextualized embedding spaces and propose a measure that reflects the compatibility of tokenizations across languages. Our goal is to prevent incompatible tokenizations, e.g., "wine" (word-level) in English vs.\ "v i n" (character-level) in French, which make it hard to learn good multilingual semantic representations. We show that our compatibility measure allows the system designer to create vocabularies across languages that are compatible -- a desideratum that so far has been neglected in multilingual models.

View on arXiv PDF

Similar