CLMay 23, 2025

EXECUTE: A Multilingual Benchmark for LLM Token Understanding

Lukas Edman, Helmut Schmid, Alexander Fraser

arXiv:2505.17784v11 citationsh-index: 2ACL

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of evaluating LLM token understanding across diverse languages for researchers, but it is incremental as it builds on an existing benchmark.

The authors extended the CUTE benchmark to multiple languages with diverse scripts, introducing EXECUTE to assess LLM token understanding, revealing that challenges vary by language, with some showing word-level issues or no issues, and they tested sub-character tasks in Chinese, Japanese, and Korean.

The CUTE benchmark showed that LLMs struggle with character understanding in English. We extend it to more languages with diverse scripts and writing systems, introducing EXECUTE. Our simplified framework allows easy expansion to any language. Tests across multiple LLMs reveal that challenges in other languages are not always on the character level as in English. Some languages show word-level processing issues, some show no issues at all. We also examine sub-character tasks in Chinese, Japanese, and Korean to assess LLMs' understanding of character components.

View on arXiv PDF

Similar