CL LGMay 18, 2020

Classification of Spam Emails through Hierarchical Clustering and Supervised Learning

Francisco Jáñez-Martino, Eduardo Fidalgo, Santiago González-Martínez, Javier Velasco-Mata

arXiv:2005.08773v20.839 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the need for more detailed spam handling for email users and administrators, though it is incremental as it applies existing methods to a new multi-class dataset.

The paper tackled the problem of spam email classification by moving beyond binary detection to multi-class categorization, achieving a micro F1 score of 95.39% with TF-IDF and SVM and processing emails in 2.13ms with TF-IDF and Naïve Bayes.

Spammers take advantage of email popularity to send indiscriminately unsolicited emails. Although researchers and organizations continuously develop anti-spam filters based on binary classification, spammers bypass them through new strategies, like word obfuscation or image-based spam. For the first time in literature, we propose to classify spam email in categories to improve the handle of already detected spam emails, instead of just using a binary model. First, we applied a hierarchical clustering algorithm to create SPEMC-$11$K (SPam EMail Classification), the first multi-class dataset, which contains three types of spam emails: Health and Technology, Personal Scams, and Sexual Content. Then, we used SPEMC-$11$K to evaluate the combination of TF-IDF and BOW encodings with Naïve Bayes, Decision Trees and SVM classifiers. Finally, we recommend for the task of multi-class spam classification the use of (i) TF-IDF combined with SVM for the best micro F1 score performance, $95.39\%$, and (ii) TD-IDF along with NB for the fastest spam classification, analyzing an email in $2.13$ms.

View on arXiv PDF

Similar