CLOct 16, 2024

KCIF: Knowledge-Conditioned Instruction Following

arXiv:2410.12972v32 citationsh-index: 6
AI Analysis

This reveals a critical limitation in LLM evaluation for AI researchers, showing that knowledge and instruction following cannot be treated separately.

The study investigated how large language models (LLMs) struggle with simple instructions that modify answers on knowledge benchmarks, finding performance drops of 40-80% across model sizes.

LLM evaluation benchmarks have traditionally separated the testing of knowledge/reasoning capabilities from instruction following. In this work, we study the interaction between knowledge and instruction following, and observe that LLMs struggle to follow simple answer modifying instructions, and are also distracted by instructions that should have no bearing on the original knowledge task answer. We leverage existing multiple-choice answer based knowledge benchmarks and apply a set of simple instructions which include manipulating text (eg.: change case), numeric quantities (eg.: increase value, change formatting), operate on lists (eg.: sort answer candidates) and distractor instructions (eg.: change case of numeric answers). We evaluate models at varying parameter sizes (1B-405B) from different model families and find that, surprisingly, all models report a significant drop in performance on such simple task compositions. While large-sized and frontier models report performance drops of 40-50%, in small and medium sized models the drop is severe (sometimes exceeding 80%). Our results highlight a limitation in the traditional separation of knowledge/reasoning and instruction following, and suggest that joint-study of these capabilities are important. We release our benchmark dataset, evaluation framework code, and results for future work.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes