CLOct 7, 2022

Measuring and Narrowing the Compositionality Gap in Language Models

AI2UW
arXiv:2210.03350v31223 citationsh-index: 114
AI Analysis

This addresses a key limitation in language models for tasks requiring multi-step reasoning, though it is incremental as it builds on existing prompting techniques.

The paper investigates the compositionality gap in language models, showing that larger models like GPT-3 improve in factual recall but not in compositional reasoning, with the gap not decreasing as model size increases. It introduces a new method, self-ask, which uses structured prompting to narrow this gap and improve accuracy by allowing integration with external tools like search engines.

We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease. This surprising result suggests that while more powerful models memorize and recall more factual knowledge, they show no corresponding improvement in their ability to perform this kind of compositional reasoning. We then demonstrate how elicitive prompting (such as chain of thought) narrows the compositionality gap by reasoning explicitly. We present a new method, self-ask, that further improves on chain of thought. In our method, the model explicitly asks itself (and answers) follow-up questions before answering the initial question. We finally show that self-ask's structured prompting lets us easily plug in a search engine to answer the follow-up questions, which additionally improves accuracy.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes