CLLGJul 5, 2024

Identifying the Source of Generation for Large Language Models

arXiv:2407.12846v12 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses a crucial problem for the safe use of LLMs by enabling source identification to improve reliability and privacy, though it is incremental as it builds on existing token representation methods.

The paper tackles the problem of LLMs not being able to identify the source documents of generated text, which affects reliability for factuality and privacy, by introducing a token-level source identification method that maps token representations to reference documents, achieving results that show a possibility for tracing documents.

Large language models (LLMs) memorize text from several sources of documents. In pretraining, LLM trains to maximize the likelihood of text but neither receives the source of the text nor memorizes the source. Accordingly, LLM can not provide document information on the generated content, and users do not obtain any hint of reliability, which is crucial for factuality or privacy infringement. This work introduces token-level source identification in the decoding step, which maps the token representation to the reference document. We propose a bi-gram source identifier, a multi-layer perceptron with two successive token representations as input for better generalization. We conduct extensive experiments on Wikipedia and PG19 datasets with several LLMs, layer locations, and identifier sizes. The overall results show a possibility of token-level source identifiers for tracing the document, a crucial problem for the safe use of LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes