Comparative Analysis of OpenAI GPT-4o and DeepSeek R1 for Scientific Text Categorization Using Prompt Engineering
This work addresses a gap in evaluating large language models for scientific text categorization, which is incremental as it applies existing models to a new domain with a custom dataset.
This study tackled the problem of categorizing sentences from scientific papers by comparing the performance of GPT-4o and DeepSeek R1 using prompt engineering, finding that DeepSeek R1's effectiveness in this specific task was previously unexplored and introducing a new evaluation method and dataset for analysis.
This study examines how large language models categorize sentences from scientific papers using prompt engineering. We use two advanced web-based models, GPT-4o (by OpenAI) and DeepSeek R1, to classify sentences into predefined relationship categories. DeepSeek R1 has been tested on benchmark datasets in its technical report. However, its performance in scientific text categorization remains unexplored. To address this gap, we introduce a new evaluation method designed specifically for this task. We also compile a dataset of cleaned scientific papers from diverse domains. This dataset provides a platform for comparing the two models. Using this dataset, we analyze their effectiveness and consistency in categorization.