To Ensemble or Not: Assessing Majority Voting Strategies for Phishing Detection with Large Language Models
It addresses the problem of improving phishing detection for cybersecurity applications, but it is incremental as it builds on existing ensemble methods.
This study investigated three majority voting strategies for phishing URL detection using Large Language Models, finding that ensembles are effective when components have equivalent performance but may not surpass the best single model or prompt if there are significant performance discrepancies.
The effectiveness of Large Language Models (LLMs) significantly relies on the quality of the prompts they receive. However, even when processing identical prompts, LLMs can yield varying outcomes due to differences in their training processes. To leverage the collective intelligence of multiple LLMs and enhance their performance, this study investigates three majority voting strategies for text classification, focusing on phishing URL detection. The strategies are: (1) a prompt-based ensemble, which utilizes majority voting across the responses generated by a single LLM to various prompts; (2) a model-based ensemble, which entails aggregating responses from multiple LLMs to a single prompt; and (3) a hybrid ensemble, which combines the two methods by sending different prompts to multiple LLMs and then aggregating their responses. Our analysis shows that ensemble strategies are most suited in cases where individual components exhibit equivalent performance levels. However, when there is a significant discrepancy in individual performance, the effectiveness of the ensemble method may not exceed that of the highest-performing single LLM or prompt. In such instances, opting for ensemble techniques is not recommended.