CLDec 19, 2022

Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks

Stanford
arXiv:2212.09912v2135 citationsh-index: 63
Originality Incremental advance
AI Analysis

This addresses a subtle but impactful technical issue for researchers and practitioners using generative models on extractive tasks, though it is an incremental improvement.

The paper identifies tokenization inconsistency as a performance bottleneck in generative models for extractive NLP tasks like question answering, and shows that fixing it with consistent tokenization yields an average +1.7 F2 gain across datasets while reducing hallucinations.

Generative models have been widely applied to solve extractive tasks, where parts of the input is extracted to form the desired output, and achieved significant success. For example, in extractive question answering (QA), generative models have constantly yielded state-of-the-art results. In this work, we identify the issue of tokenization inconsistency that is commonly neglected in training these models. This issue damages the extractive nature of these tasks after the input and output are tokenized inconsistently by the tokenizer, and thus leads to performance drop as well as hallucination. We propose a simple yet effective fix to this issue and conduct a case study on extractive QA. We show that, with consistent tokenization, the model performs better in both in-domain and out-of-domain datasets, with a notable average of +1.7 F2 gain when a BART model is trained on SQuAD and evaluated on 8 QA datasets. Further, the model converges faster, and becomes less likely to generate out-of-context answers. With these findings, we would like to call for more attention on how tokenization should be done when solving extractive tasks and recommend applying consistent tokenization during training.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes