Outcome-Constrained Large Language Models for Countering Hate Speech
This work addresses the challenge of making counterspeech more impactful in online environments for combating hate speech, though it appears incremental by building on existing LLM techniques.
The study tackled the problem of generating counterspeech that achieves specific conversation outcomes, such as low incivility and non-hateful reentry, by experimenting with large language models using prompts, finetuning, and reinforcement learning, and found that these methods effectively steer generation toward desired outcomes with variations in quality and style across models.
Automatic counterspeech generation methods have been developed to assist efforts in combating hate speech. Existing research focuses on generating counterspeech with linguistic attributes such as being polite, informative, and intent-driven. However, the real impact of counterspeech in online environments is seldom considered. This study aims to develop methods for generating counterspeech constrained by conversation outcomes and evaluate their effectiveness. We experiment with large language models (LLMs) to incorporate into the text generation process two desired conversation outcomes: low conversation incivility and non-hateful hater reentry. Specifically, we experiment with instruction prompts, LLM finetuning, and LLM reinforcement learning (RL). Evaluation results show that our methods effectively steer the generation of counterspeech toward the desired outcomes. Our analyses, however, show that there are differences in the quality and style depending on the model.