CRAIHCJul 23, 2025

MeAJOR Corpus: A Multi-Source Dataset for Phishing Email Detection

arXiv:2507.17978v21 citationsh-index: 3Has Code
Originality Synthesis-oriented
AI Analysis

This provides a reusable dataset for cybersecurity researchers to improve phishing detection, though it is incremental as it builds on existing data sources.

The paper tackles the problem of limited training data for phishing email detection by introducing the MeAJOR Corpus, a multi-source dataset with 135,894 samples, and demonstrates its effectiveness by achieving 98.34% F1 score with XGBoost.

Phishing emails continue to pose a significant threat to cybersecurity by exploiting human vulnerabilities through deceptive content and malicious payloads. While Machine Learning (ML) models are effective at detecting phishing threats, their performance largely relies on the quality and diversity of the training data. This paper presents MeAJOR (Merged email Assets from Joint Open-source Repositories) Corpus, a novel, multi-source phishing email dataset designed to overcome critical limitations in existing resources. It integrates 135894 samples representing a broad number of phishing tactics and legitimate emails, with a wide spectrum of engineered features. We evaluated the dataset's utility for phishing detection research through systematic experiments with four classification models (RF, XGB, MLP, and CNN) across multiple feature configurations. Results highlight the dataset's effectiveness, achieving 98.34% F1 with XGB. By integrating broad features from multiple categories, our dataset provides a reusable and consistent resource, while addressing common challenges like class imbalance, generalisability and reproducibility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes