CLFeb 15, 2024

Knowledge of Pretrained Language Models on Surface Information of Tokens

arXiv:2402.09808v24.86 citationsh-index: 7

Originality Synthesis-oriented

AI Analysis

This work addresses the understanding of internal knowledge in pretrained language models for researchers in NLP, though it is incremental as it builds on existing analysis of model embeddings.

The study investigated whether pretrained language models possess knowledge about token surface information, such as length, substrings, and constitution, using 12 models trained on English and Japanese corpora. Results showed that models have knowledge of token length and substrings but not token constitution, and identified a bottleneck on the decoder side in utilizing this knowledge.

Do pretrained language models have knowledge regarding the surface information of tokens? We examined the surface information stored in word or subword embeddings acquired by pretrained language models from the perspectives of token length, substrings, and token constitution. Additionally, we evaluated the ability of models to generate knowledge regarding token surfaces. We focused on 12 pretrained language models that were mainly trained on English and Japanese corpora. Experimental results demonstrate that pretrained language models have knowledge regarding token length and substrings but not token constitution. Additionally, the results imply that there is a bottleneck on the decoder side in terms of effectively utilizing acquired knowledge.

View on arXiv PDF

Similar