CLAug 7, 2025

TASE: Token Awareness and Structured Evaluation for Multilingual Language Models

Chenzhuo Zhao, Xinda Wang, Yue Huang, Junting Lu, Ziqian Liu

arXiv:2508.05468v11 citationsh-index: 1Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for better diagnostic tools to assess and improve low-level language understanding and cross-lingual generalization in LLMs, which is crucial for applications requiring precision and control, though it is incremental as it builds on existing evaluation frameworks.

The authors tackled the problem of evaluating large language models' fine-grained token-level understanding and structural reasoning across languages, introducing the TASE benchmark covering 10 tasks in Chinese, English, and Korean with 35,927 instances, and found that human performance significantly outpaces current LLMs, revealing persistent weaknesses in token-level reasoning.

While large language models (LLMs) have demonstrated remarkable performance on high-level semantic tasks, they often struggle with fine-grained, token-level understanding and structural reasoning--capabilities that are essential for applications requiring precision and control. We introduce TASE, a comprehensive benchmark designed to evaluate LLMs' ability to perceive and reason about token-level information across languages. TASE covers 10 tasks under two core categories: token awareness and structural understanding, spanning Chinese, English, and Korean, with a 35,927-instance evaluation set and a scalable synthetic data generation pipeline for training. Tasks include character counting, token alignment, syntactic structure parsing, and length constraint satisfaction. We evaluate over 30 leading commercial and open-source LLMs, including O3, Claude 4, Gemini 2.5 Pro, and DeepSeek-R1, and train a custom Qwen2.5-14B model using the GRPO training method. Results show that human performance significantly outpaces current LLMs, revealing persistent weaknesses in token-level reasoning. TASE sheds light on these limitations and provides a new diagnostic lens for future improvements in low-level language understanding and cross-lingual generalization. Our code and dataset are publicly available at https://github.com/cyzcz/Tase .

View on arXiv PDF Code

Similar