CLAILGMay 3, 2024

Hoaxpedia: A Unified Wikipedia Hoax Articles Dataset

arXiv:2405.02175v322 citationsh-index: 6Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia
Originality Synthesis-oriented
AI Analysis

This addresses the issue of disinformation in reference resources like Wikipedia for researchers and users, but it is incremental as it builds on existing datasets and methods.

The authors tackled the problem of detecting hoax articles on Wikipedia by creating Hoaxpedia, a dataset of 311 hoax articles paired with legitimate ones, and found that automated detection based on content is challenging but possible, with edit histories yielding better classification results than text alone.

Hoaxes are a recognised form of disinformation created deliberately, with potential serious implications in the credibility of reference knowledge resources such as Wikipedia. What makes detecting Wikipedia hoaxes hard is that they often are written according to the official style guidelines. In this work, we first provide a systematic analysis of similarities and discrepancies between legitimate and hoax Wikipedia articles, and introduce Hoaxpedia, a collection of 311 hoax articles (from existing literature and official Wikipedia lists), together with semantically similar legitimate articles, which together form a binary text classification dataset aimed at fostering research in automated hoax detection. In this paper, We report results after analyzing several language models, hoax-to-legit ratios, and the amount of text classifiers are exposed to (full article vs the article's definition alone). Our results suggest that detecting deceitful content in Wikipedia based on content alone is hard but feasible, and complement our analysis with a study on the differences in distributions in edit histories, and find that looking at this feature yields better classification results than context.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes