CRCLFeb 19, 2020

Attacking Neural Text Detectors

arXiv:2002.11768v457 citations
AI Analysis

This work addresses the challenge of misinformation spread by AI language models, showing vulnerabilities in existing detection methods, though it is incremental as it builds on prior attack strategies.

The paper tackled the problem of detecting AI-generated text by proposing two black-box attacks that use homoglyphs and intentional misspellings to evade neural text detectors, reducing a detector's recall from 97.44% to as low as 0.26%.

Machine learning based language models have recently made significant progress, which introduces a danger to spread misinformation. To combat this potential danger, several methods have been proposed for detecting text written by these language models. This paper presents two classes of black-box attacks on these detectors, one which randomly replaces characters with homoglyphs, and the other a simple scheme to purposefully misspell words. The homoglyph and misspelling attacks decrease a popular neural text detector's recall on neural text from 97.44% to 0.26% and 22.68%, respectively. Results also indicate that the attacks are transferable to other neural text detectors.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes