Evaluating o1-Like LLMs: Unlocking Reasoning for Translation through Comprehensive Analysis
It addresses performance gaps for AI researchers and practitioners in translation, though it is incremental as it builds on existing LLM evaluations.
This study evaluated o1-Like LLMs in multilingual machine translation, finding they set new benchmarks with DeepSeek-R1 outperforming GPT-4o in contextless tasks, but identified issues like rambling in Chinese outputs and high resource costs.
The o1-Like LLMs are transforming AI by simulating human cognitive processes, but their performance in multilingual machine translation (MMT) remains underexplored. This study examines: (1) how o1-Like LLMs perform in MMT tasks and (2) what factors influence their translation quality. We evaluate multiple o1-Like LLMs and compare them with traditional models like ChatGPT and GPT-4o. Results show that o1-Like LLMs establish new multilingual translation benchmarks, with DeepSeek-R1 surpassing GPT-4o in contextless tasks. They demonstrate strengths in historical and cultural translation but exhibit a tendency for rambling issues in Chinese-centric outputs. Further analysis reveals three key insights: (1) High inference costs and slower processing speeds make complex translation tasks more resource-intensive. (2) Translation quality improves with model size, enhancing commonsense reasoning and cultural translation. (3) The temperature parameter significantly impacts output quality-lower temperatures yield more stable and accurate translations, while higher temperatures reduce coherence and precision.