Assessing how hyperparameters impact Large Language Models' sarcasm detection performance
This study addresses the challenge of sarcasm detection for NLP researchers and practitioners, but it is incremental as it focuses on hyperparameter tuning and model evaluation without introducing new methods.
This work investigates how hyperparameters and model characteristics affect sarcasm detection performance in large language models like GPT and Llama-2, finding that fine-tuned Llama-2-13b achieves state-of-the-art accuracy and F1-score of 0.83, comparable to human performance, while zero-shot GPT-4 reaches competitive accuracy of 0.70 and F1-score of 0.75.
Sarcasm detection is challenging for both humans and machines. This work explores how model characteristics impact sarcasm detection in OpenAI's GPT, and Meta's Llama-2 models, given their strong natural language understanding, and popularity. We evaluate fine-tuned and zero-shot models across various sizes, releases, and hyperparameters. Experiments were conducted on the political and balanced (pol-bal) portion of the popular Self-Annotated Reddit Corpus (SARC2.0) sarcasm dataset. Fine-tuned performance improves monotonically with model size within a model family, while hyperparameter tuning also impacts performance. In the fine-tuning scenario, full precision Llama-2-13b achieves state-of-the-art accuracy and $F_1$-score, both measured at 0.83, comparable to average human performance. In the zero-shot setting, one GPT-4 model achieves competitive performance to prior attempts, yielding an accuracy of 0.70 and an $F_1$-score of 0.75. Furthermore, a model's performance may increase or decline with each release, highlighting the need to reassess performance after each release.