CLNov 7, 2024

Tomato, Tomahto, Tomate: Do Multilingual Language Models Understand Based on Subword-Level Semantic Concepts?

MIT
arXiv:2411.04530v212 citationsh-index: 31NAACL
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of understanding cross-lingual transfer in language models for researchers, but it is incremental as it builds on existing methods with new evaluations.

The study investigated whether multilingual language models understand text based on subword-level semantic concepts by merging semantically similar subwords and evaluating on five multilingual tasks, finding that shared semantics improved predictions and zero-shot results matched or exceeded original models on some classification tasks.

Human understanding of text depends on general semantic concepts of words rather than their superficial forms. To what extent does our human intuition transfer to language models? In this work, we study the degree to which current multilingual language models (mLMs) understand based on subword-level semantic concepts. To this end, we form "semantic tokens" by merging the semantically similar subwords and their embeddings, and evaluate the updated mLMs on five heterogeneous multilingual downstream tasks. Results show that the general shared semantics could get the models a long way in making the predictions on mLMs with different tokenizers and model sizes. Inspections of the grouped subwords show that they exhibit a wide range of semantic similarities, including synonyms and translations across many languages and scripts. Lastly, we find that the zero-shot results with semantic tokens are on par with or even better than the original models on certain classification tasks, suggesting that the shared subword-level semantics may serve as the anchors for cross-lingual transfer.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes