CLAILGAug 25, 2025

Understanding Subword Compositionality of Large Language Models

arXiv:2508.17953v14 citationsh-index: 12EMNLP
Originality Synthesis-oriented
AI Analysis

This work provides insights into the internal mechanisms of LLMs, which is incremental for researchers and practitioners in natural language processing.

The paper investigates how large language models (LLMs) compose subword information into word-level representations, analyzing structural similarity, semantic decomposability, and form retention, and finds that LLM families can be classified into three distinct groups based on their compositional patterns.

Large language models (LLMs) take sequences of subwords as input, requiring them to effective compose subword representations into meaningful word-level representations. In this paper, we present a comprehensive set of experiments to probe how LLMs compose subword information, focusing on three key aspects: structural similarity, semantic decomposability, and form retention. Our analysis of the experiments suggests that these five LLM families can be classified into three distinct groups, likely reflecting difference in their underlying composition strategies. Specifically, we observe (i) three distinct patterns in the evolution of structural similarity between subword compositions and whole-word representations across layers; (ii) great performance when probing layer by layer their sensitivity to semantic decompositionality; and (iii) three distinct patterns when probing sensitivity to formal features, e.g., character sequence length. These findings provide valuable insights into the compositional dynamics of LLMs and highlight different compositional pattens in how LLMs encode and integrate subword information.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes