Classification of Spam Emails through Hierarchical Clustering and Supervised Learning
This work addresses the need for more detailed spam handling for email users and administrators, though it is incremental as it applies existing methods to a new multi-class dataset.
The paper tackled the problem of spam email classification by moving beyond binary detection to multi-class categorization, achieving a micro F1 score of 95.39% with TF-IDF and SVM and processing emails in 2.13ms with TF-IDF and Naïve Bayes.
Spammers take advantage of email popularity to send indiscriminately unsolicited emails. Although researchers and organizations continuously develop anti-spam filters based on binary classification, spammers bypass them through new strategies, like word obfuscation or image-based spam. For the first time in literature, we propose to classify spam email in categories to improve the handle of already detected spam emails, instead of just using a binary model. First, we applied a hierarchical clustering algorithm to create SPEMC-$11$K (SPam EMail Classification), the first multi-class dataset, which contains three types of spam emails: Health and Technology, Personal Scams, and Sexual Content. Then, we used SPEMC-$11$K to evaluate the combination of TF-IDF and BOW encodings with Naïve Bayes, Decision Trees and SVM classifiers. Finally, we recommend for the task of multi-class spam classification the use of (i) TF-IDF combined with SVM for the best micro F1 score performance, $95.39\%$, and (ii) TD-IDF along with NB for the fastest spam classification, analyzing an email in $2.13$ms.