LG AI CLSep 15, 2023

Chain-of-Thought Reasoning is a Policy Improvement Operator

arXiv:2309.08589v214.918 citationsh-index: 1

Originality Incremental advance

AI Analysis

This research addresses the need for language models to self-educate without human demonstrations, offering a proof-of-concept that could lead to new directions in autonomous learning, though it is incremental as it builds on existing chain-of-thought and AlphaZero concepts.

The paper tackles the problem of language models' inability to teach themselves new skills by introducing SECToR, a method where models use chain-of-thought reasoning to solve addition problems and then train to solve them directly, resulting in models that autonomously learn to add numbers with up to the longest-length digits without ground truth examples beyond an initial fine-tuning on 6-digit numbers.

Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on large amounts of human-generated training data. We introduce SECToR (Self-Education via Chain-of-Thought Reasoning), a proof-of-concept demonstration that language models can teach themselves new skills using chain-of-thought reasoning. During the self-learning loop, SECToR asks models to solve addition problems using chain-of-thought reasoning before training the next version of the model to solve those same problems directly without using such reasoning. This process often results in an improved model which can, when again augmented with chain-of-thought reasoning, solve even harder problems than the original model, allowing the self-learning loop to continue. Language models trained via SECToR autonomously learn to add up to the longest-length-digit numbers without access to any ground truth examples beyond an initial supervised fine-tuning phase consisting only of numbers with 6 or fewer digits. Our central hypothesis is that chain-of-thought reasoning can act as a policy improvement operator, similarly to how Monte-Carlo Tree Search is used in AlphaZero (Silver et al., 2017). We hope that this research can lead to new directions in which language models can learn to teach themselves without the need for human demonstrations.

View on arXiv PDF

Similar