CL AIApr 8, 2025

Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions

Oded Ovadia, Meni Brief, Rachel Lemberg, Eitam Sheetrit

Microsoft

arXiv:2504.05571v113.07 citationsh-index: 18

Originality Incremental advance

AI Analysis

This addresses the challenge of updating LLMs with niche information for users needing domain-specific applications, though it appears incremental as it builds on existing instruction-tuning methods.

The paper tackles the problem of injecting domain-specific or new knowledge into large language models through continual pre-training, which often suffers from catastrophic forgetting and inefficiency with limited data, by introducing Knowledge-Instruct, an approach that uses instruction-tuning with synthetic data to achieve superior factual memorization and minimize forgetting while enhancing contextual understanding.

While Large Language Models (LLMs) acquire vast knowledge during pre-training, they often lack domain-specific, new, or niche information. Continual pre-training (CPT) attempts to address this gap but suffers from catastrophic forgetting and inefficiencies in low-data regimes. We introduce Knowledge-Instruct, a novel approach to efficiently inject knowledge from limited corpora through pure instruction-tuning. By generating information-dense synthetic instruction data, it effectively integrates new knowledge while preserving general reasoning and instruction-following abilities. Knowledge-Instruct demonstrates superior factual memorization, minimizes catastrophic forgetting, and remains scalable by leveraging synthetic data from relatively small language models. Additionally, it enhances contextual understanding, including complex multi-hop reasoning, facilitating integration with retrieval systems. We validate its effectiveness across diverse benchmarks, including Companies, a new dataset that we release to measure knowledge injection capabilities.

View on arXiv PDF

Similar