LGAICLMay 23, 2023

Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks

arXiv:2305.14201v1107 citations
Originality Highly original
AI Analysis

This addresses the problem of poor arithmetic performance in large language models for researchers and practitioners, representing a strong specific gain rather than incremental progress.

The researchers tackled arithmetic reasoning in language models by fine-tuning LLaMA to create Goat, which outperforms GPT-4 on arithmetic tasks and achieves near-perfect accuracy on large-number addition/subtraction while matching or surpassing PaLM-540B's accuracy with a much smaller model.

We introduce Goat, a fine-tuned LLaMA model that significantly outperforms GPT-4 on a range of arithmetic tasks. Fine-tuned on a synthetically generated dataset, Goat achieves state-of-the-art performance on BIG-bench arithmetic sub-task. In particular, the zero-shot Goat-7B matches or even surpasses the accuracy achieved by the few-shot PaLM-540B. Surprisingly, Goat can achieve near-perfect accuracy on large-number addition and subtraction through supervised fine-tuning only, which is almost impossible with previous pretrained language models, such as Bloom, OPT, GPT-NeoX, etc. We attribute Goat's exceptional performance to LLaMA's consistent tokenization of numbers. To tackle more challenging tasks like large-number multiplication and division, we propose an approach that classifies tasks based on their learnability, and subsequently decomposes unlearnable tasks, such as multi-digit multiplication and division, into a series of learnable tasks by leveraging basic arithmetic principles. We thoroughly examine the performance of our model, offering a comprehensive evaluation of the effectiveness of our proposed decomposition steps. Additionally, Goat-7B can be easily trained using LoRA on a 24GB VRAM GPU, facilitating reproducibility for other researchers. We release our model, dataset, and the Python script for dataset generation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes