CLAIOct 14, 2024

Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning

arXiv:2410.11020v57 citationsh-index: 5EMNLP
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving NLU capabilities in LLMs for AI applications, though it is incremental as it applies an existing RL method to a known bottleneck.

The paper tackled the problem of large language models (LLMs) underperforming on natural language understanding (NLU) tasks by using reinforcement learning, specifically Proximal Policy Optimization (PPO), resulting in an average improvement of 6.3 points on GLUE and outperforming GPT-4o by over 4% on average across tasks.

Instruction-fine-tuned large language models (LLMs) under 14B parameters continue to underperform on natural language understanding (NLU) tasks, often trailing smaller models like BERT-base on benchmarks such as GLUE and SuperGLUE. Motivated by the success of reinforcement learning in reasoning tasks (e.g., DeepSeek), we explore Proximal Policy Optimization (PPO) as a framework to improve the NLU capabilities of LLMs. We frame NLU as a reinforcement learning environment, treating token generation as a sequence of actions and optimizing for reward signals based on alignment with ground-truth labels. PPO consistently outperforms supervised fine-tuning, yielding an average improvement of 6.3 points on GLUE, and surpasses zero-shot and few-shot prompting by 38.7 and 26.1 points, respectively. Notably, PPO-tuned models outperform GPT-4o by over 4\% on average across sentiment and natural language inference tasks, including gains of 7.3\% on the Mental Health dataset and 10.9\% on SIGA-nli. This work highlights a promising direction for adapting LLMs to new tasks by reframing them as reinforcement learning problems, enabling learning through simple end-task rewards rather than extensive data curation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes