CLFeb 15, 2024

Knowledge of Pretrained Language Models on Surface Information of Tokens

arXiv:2402.09808v26 citationsh-index: 7
Originality Synthesis-oriented
AI Analysis

This work addresses the understanding of internal knowledge in pretrained language models for researchers in NLP, though it is incremental as it builds on existing analysis of model embeddings.

The study investigated whether pretrained language models possess knowledge about token surface information, such as length, substrings, and constitution, using 12 models trained on English and Japanese corpora. Results showed that models have knowledge of token length and substrings but not token constitution, and identified a bottleneck on the decoder side in utilizing this knowledge.

Do pretrained language models have knowledge regarding the surface information of tokens? We examined the surface information stored in word or subword embeddings acquired by pretrained language models from the perspectives of token length, substrings, and token constitution. Additionally, we evaluated the ability of models to generate knowledge regarding token surfaces. We focused on 12 pretrained language models that were mainly trained on English and Japanese corpora. Experimental results demonstrate that pretrained language models have knowledge regarding token length and substrings but not token constitution. Additionally, the results imply that there is a bottleneck on the decoder side in terms of effectively utilizing acquired knowledge.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes