CLAIDec 5, 2023

GPT vs Human for Scientific Reviews: A Dual Source Review on Applications of ChatGPT in Science

arXiv:2312.03769v17 citationsh-index: 142
Originality Synthesis-oriented
AI Analysis

This addresses the problem of evaluating LLM-assisted scientific reviews for researchers, but it is incremental as it builds on existing comparisons of AI and human performance.

The study compared GPT-based and human scientific reviews across 13 papers, finding that 50% of SciSpace's objective responses aligned with human reviewers, with GPT-4 rating human reviewers higher in accuracy but SciSpace higher in structure, clarity, and completeness.

The new polymath Large Language Models (LLMs) can speed-up greatly scientific reviews, possibly using more unbiased quantitative metrics, facilitating cross-disciplinary connections, and identifying emerging trends and research gaps by analyzing large volumes of data. However, at the present time, they lack the required deep understanding of complex methodologies, they have difficulty in evaluating innovative claims, and they are unable to assess ethical issues and conflicts of interest. Herein, we consider 13 GPT-related papers across different scientific domains, reviewed by a human reviewer and SciSpace, a large language model, with the reviews evaluated by three distinct types of evaluators, namely GPT-3.5, a crowd panel, and GPT-4. We found that 50% of SciSpace's responses to objective questions align with those of a human reviewer, with GPT-4 (informed evaluator) often rating the human reviewer higher in accuracy, and SciSpace higher in structure, clarity, and completeness. In subjective questions, the uninformed evaluators (GPT-3.5 and crowd panel) showed varying preferences between SciSpace and human responses, with the crowd panel showing a preference for the human responses. However, GPT-4 rated them equally in accuracy and structure but favored SciSpace for completeness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes