The Moral Gap of Large Language Models
This work addresses the challenge of ethically-aligned AI systems for researchers and developers, but it is incremental as it confirms existing knowledge about task-specific fine-tuning superiority.
The study tackled the problem of moral foundation detection in social discourse by comparing state-of-the-art large language models (LLMs) with fine-tuned transformers on Twitter and Reddit datasets, revealing substantial performance gaps with LLMs showing high false negative rates and systematic under-detection of moral content.
Moral foundation detection is crucial for analyzing social discourse and developing ethically-aligned AI systems. While large language models excel across diverse tasks, their performance on specialized moral reasoning remains unclear. This study provides the first comprehensive comparison between state-of-the-art LLMs and fine-tuned transformers across Twitter and Reddit datasets using ROC, PR, and DET curve analysis. Results reveal substantial performance gaps, with LLMs exhibiting high false negative rates and systematic under-detection of moral content despite prompt engineering efforts. These findings demonstrate that task-specific fine-tuning remains superior to prompting for moral reasoning applications.