CL SESep 11, 2024

How Effectively Do LLMs Extract Feature-Sentiment Pairs from App Reviews?

Faiz Ali Shah, Ahmed Sabir, Rajesh Sharma, Dietmar Pfahl

arXiv:2409.07162v32.76 citationsh-index: 11Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for efficient sentiment analysis in app development to align with user feedback, but it is incremental as it primarily benchmarks existing LLMs on a specific task.

The study tackled the problem of automatically analyzing user reviews to extract app features and associated sentiments, comparing state-of-the-art LLMs like GPT-4 against previous methods in zero-shot and few-shot scenarios. Results showed GPT-4 outperformed rule-based SAFE by 17% in F1-score for feature extraction in zero-shot, with further improvements in few-shot settings, but was exceeded by fine-tuned RE-BERT by 6%.

Automatic analysis of user reviews to understand user sentiments toward app functionality (i.e. app features) helps align development efforts with user expectations and needs. Recent advances in Large Language Models (LLMs) such as ChatGPT have shown impressive performance on several new tasks without updating the model's parameters i.e. using zero or a few labeled examples, but the capabilities of LLMs are yet unexplored for feature-specific sentiment analysis. The goal of our study is to explore the capabilities of LLMs to perform feature-specific sentiment analysis of user reviews. This study compares the performance of state-of-the-art LLMs, including GPT-4, ChatGPT, and different variants of Llama-2 chat, against previous approaches for extracting app features and associated sentiments in zero-shot, 1-shot, and 5-shot scenarios. The results indicate that GPT-4 outperforms the rule-based SAFE by 17% in f1-score for extracting app features in the zero-shot scenario, with 5-shot further improving it by 6%. However, the fine-tuned RE-BERT exceeds GPT-4 by 6% in f1-score. For predicting positive and neutral sentiments, GPT-4 achieves f1-scores of 76% and 45% in the zero-shot setting, which improve by 7% and 23% in the 5-shot setting, respectively. Our study conducts a thorough evaluation of both proprietary and open-source LLMs to provide an objective assessment of their performance in extracting feature-sentiment pairs.

View on arXiv PDF Code

Similar