CL CYFeb 1, 2025

Benchmark on Peer Review Toxic Detection: A Challenging Task with a New Dataset

Man Luo, Bradley Peterson, Rafael Gan, Hari Ramalingame, Navya Gangrade, Ariadne Dimarogona, Imon Banerjee, Phillip Howard

arXiv:2502.01676v12.7h-index: 7Has Code

Originality Synthesis-oriented

AI Analysis

It addresses the issue of toxic feedback in peer reviews to foster a healthier academic environment, but is incremental as it focuses on benchmarking existing methods on a new dataset.

This work tackles the problem of detecting toxic feedback in peer reviews by curating a new annotated dataset from OpenReview and benchmarking various models, including LLMs, showing that detailed instructions improve alignment with human judgments, with GPT-4 achieving a Cohen's Kappa score of up to 0.63.

Peer review is crucial for advancing and improving science through constructive criticism. However, toxic feedback can discourage authors and hinder scientific progress. This work explores an important but underexplored area: detecting toxicity in peer reviews. We first define toxicity in peer reviews across four distinct categories and curate a dataset of peer reviews from the OpenReview platform, annotated by human experts according to these definitions. Leveraging this dataset, we benchmark a variety of models, including a dedicated toxicity detection model, a sentiment analysis model, several open-source large language models (LLMs), and two closed-source LLMs. Our experiments explore the impact of different prompt granularities, from coarse to fine-grained instructions, on model performance. Notably, state-of-the-art LLMs like GPT-4 exhibit low alignment with human judgments under simple prompts but achieve improved alignment with detailed instructions. Moreover, the model's confidence score is a good indicator of better alignment with human judgments. For example, GPT-4 achieves a Cohen's Kappa score of 0.56 with human judgments, which increases to 0.63 when using only predictions with a confidence score higher than 95%. Overall, our dataset and benchmarks underscore the need for continued research to enhance toxicity detection capabilities of LLMs. By addressing this issue, our work aims to contribute to a healthy and responsible environment for constructive academic discourse and scientific collaboration.

View on arXiv PDF

Similar