CLCYSIJan 31, 2025

Large Language Models' Accuracy in Emulating Human Experts' Evaluation of Public Sentiments about Heated Tobacco Products on Social Media

arXiv:2502.01658v16 citationsh-index: 3J Med Internet Res
Originality Synthesis-oriented
AI Analysis

This work addresses the need for efficient tobacco control research by reducing labor-intensive human analysis, though it is incremental as it applies existing LLMs to a new domain-specific dataset.

The study tackled the problem of automating sentiment analysis for heated tobacco products on social media by evaluating the accuracy of LLMs like GPT-3.5 and GPT-4 Turbo in replicating human expert evaluations, with GPT-4 Turbo achieving around 80% accuracy on Facebook and Twitter messages.

Sentiment analysis of alternative tobacco products on social media is important for tobacco control research. Large Language Models (LLMs) can help streamline the labor-intensive human sentiment analysis process. This study examined the accuracy of LLMs in replicating human sentiment evaluation of social media messages about heated tobacco products (HTPs). The research used GPT-3.5 and GPT-4 Turbo to classify 500 Facebook and 500 Twitter messages, including anti-HTPs, pro-HTPs, and neutral messages. The models evaluated each message up to 20 times, and their majority label was compared to human evaluators. Results showed that GPT-3.5 accurately replicated human sentiment 61.2% of the time for Facebook messages and 57.0% for Twitter messages. GPT-4 Turbo performed better, with 81.7% accuracy for Facebook and 77.0% for Twitter. Using three response instances, GPT-4 Turbo achieved 99% of the accuracy of twenty instances. GPT-4 Turbo also had higher accuracy for anti- and pro-HTPs messages compared to neutral ones. Misclassifications by GPT-3.5 often involved anti- or pro-HTPs messages being labeled as neutral or irrelevant, while GPT-4 Turbo showed improvements across all categories. In conclusion, LLMs can be used for sentiment analysis of HTP-related social media messages, with GPT-4 Turbo reaching around 80% accuracy compared to human experts. However, there's a risk of misrepresenting overall sentiment due to differences in accuracy across sentiment categories.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes