CLApr 10, 2025

DeepSeek-R1 vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?

Daniil Larionov, Sotaro Takeshita, Ran Zhang, Yanran Chen, Christoph Leiter, Zhipin Wang, Christian Greisinger, Steffen Eger

arXiv:2504.08120v312.08 citationsh-index: 7Has Code

Originality Incremental advance

AI Analysis

It addresses the problem of evaluating natural language generation for researchers, providing the first systematic comparison of reasoning LLMs in this domain, though it is incremental as it builds on existing models and benchmarks.

This study compared reasoning and non-reasoning large language models for evaluating machine translation and text summarization, finding that OpenAI o3-mini improved with more reasoning on MT tasks, while DeepSeek-R1 generally underperformed except in summarization consistency, and distillation maintained performance down to 32B parameters but degraded at 8B scale.

Reasoning-enabled large language models (LLMs) excel in logical tasks, yet their utility for evaluating natural language generation remains unexplored. This study systematically compares reasoning LLMs with non-reasoning counterparts across machine translation and text summarization evaluation tasks. We evaluate eight models spanning state-of-the-art reasoning models (DeepSeek-R1, OpenAI o3), their distilled variants (8B-70B parameters), and equivalent non-reasoning LLMs. Experiments on WMT23 and SummEval benchmarks reveal architecture and task-dependent benefits: OpenAI o3-mini models show improved performance with increased reasoning on MT, while DeepSeek-R1 and generally underperforms compared to its non-reasoning variant except in summarization consistency evaluation. Correlation analysis demonstrates that reasoning token usage correlates with evaluation quality only in specific models, while almost all models generally allocate more reasoning tokens when identifying more quality issues. Distillation maintains reasonable performance up to 32B parameter models but degrades substantially at 8B scale. This work provides the first assessment of reasoning LLMs for NLG evaluation and comparison to non-reasoning models. We share our code to facilitate further research: https://github.com/NL2G/reasoning-eval.

View on arXiv PDF Code

Similar