CLLGDec 4, 2025

Sarcasm Detection on Reddit Using Classical Machine Learning and Feature Engineering

arXiv:2512.04396v11 citationsInt J Data Sci
Originality Synthesis-oriented
AI Analysis

This provides a lightweight, interpretable baseline for sarcasm detection in online discussions, though it is incremental due to reliance on existing methods and data.

The paper tackled sarcasm detection on Reddit using classical machine learning and feature engineering, achieving F1-scores around 0.57 for sarcastic comments with methods like logistic regression and Naive Bayes.

Sarcasm is common in online discussions, yet difficult for machines to identify because the intended meaning often contradicts the literal wording. In this work, I study sarcasm detection using only classical machine learning methods and explicit feature engineering, without relying on neural networks or context from parent comments. Using a 100,000-comment subsample of the Self-Annotated Reddit Corpus (SARC 2.0), I combine word-level and character-level TF-IDF features with simple stylistic indicators. Four models are evaluated: logistic regression, a linear SVM, multinomial Naive Bayes, and a random forest. Naive Bayes and logistic regression perform the strongest, achieving F1-scores around 0.57 for sarcastic comments. Although the lack of conversational context limits performance, the results offer a clear and reproducible baseline for sarcasm detection using lightweight and interpretable methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes