Memorization Dynamics of Fill-in-the-Middle Pretraining

arXiv:2605.2298144.2

AI Analysis

This work provides insights into memorization dynamics of FIM pretraining for researchers developing language models, though the findings are incremental and based on controlled experiments.

The paper investigates how fill-in-the-middle (FIM) pretraining affects verbatim memorization compared to standard left-to-right (LTR) pretraining, finding that FIM more often recovers short or partially matching spans while LTR assigns higher confidence to long exact continuations, and that verbatim extraction under FIM grows approximately linearly with repetitions.

Fill-in-the-middle (FIM) is a pretraining objective widely used to equip causal language models with infilling ability, yet its effect on verbatim memorization remains underexplored. We study the memorization dynamics of FIM in a controlled setting by pretraining matched Llama 3.2 models with FIM and standard left-to-right (LTR) objectives on a FineWeb-Gutenberg corpus containing repeated Gutenberg excerpts. With prefix-based probes, FIM more often recovers short or partially matching spans, while LTR more often assigns high confidence to long exact continuations. We observe that verbatim extraction under FIM-training grows approximately linearly with repetitions over the tested range. Evaluating native FIM-format probes reveals that suffix context is not sufficient: verbatim recall under FIM-training remains strongly anchored in prefix context. Our results also show that evaluating only one span length or probing format can miss important nuances in memorization behavior.

View on arXiv PDF

Similar