LGJan 30, 2025

Token-Hungry, Yet Precise: DeepSeek R1 Highlights the Need for Multi-Step Reasoning Over Speed in MATH

arXiv:2501.18576v113 citationsh-index: 3
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of selecting optimal language models for mathematical reasoning tasks, but it is incremental as it focuses on a specific dataset and model comparison without introducing new methods.

This study tackled the challenge of solving 30 difficult MATH dataset problems by removing time constraints and testing DeepSeek R1 against four other models, finding that DeepSeek R1 achieved superior accuracy but generated significantly more tokens, highlighting a trade-off between accuracy and efficiency.

This study investigates the performance of the DeepSeek R1 language model on 30 challenging mathematical problems derived from the MATH dataset, problems that previously proved unsolvable by other models under time constraints. Unlike prior work, this research removes time limitations to explore whether DeepSeek R1's architecture, known for its reliance on token-based reasoning, can achieve accurate solutions through a multi-step process. The study compares DeepSeek R1 with four other models (gemini-1.5-flash-8b, gpt-4o-mini-2024-07-18, llama3.1:8b, and mistral-8b-latest) across 11 temperature settings. Results demonstrate that DeepSeek R1 achieves superior accuracy on these complex problems but generates significantly more tokens than other models, confirming its token-intensive approach. The findings highlight a trade-off between accuracy and efficiency in mathematical problem-solving with large language models: while DeepSeek R1 excels in accuracy, its reliance on extensive token generation may not be optimal for applications requiring rapid responses. The study underscores the importance of considering task-specific requirements when selecting an LLM and emphasizes the role of temperature settings in optimizing performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes