CLApr 10, 2024

Simpler becomes Harder: Do LLMs Exhibit a Coherent Behavior on Simplified Corpora?

Miriam Anschütz, Edoardo Mosca, Georg Groh

arXiv:2404.06838v123.982 citationsh-index: 11Has CodeDETERMIT

Originality Incremental advance

AI Analysis

This highlights a critical vulnerability in pre-trained models for NLP applications, as simplified text can be exploited for attacks, posing risks to reliability and security.

The study investigated whether pre-trained classifiers maintain prediction coherence between original and simplified text inputs, revealing alarming inconsistencies across all tested languages and models, with simplified inputs enabling zero-iteration adversarial attacks achieving up to 50% success rates.

Text simplification seeks to improve readability while retaining the original content and meaning. Our study investigates whether pre-trained classifiers also maintain such coherence by comparing their predictions on both original and simplified inputs. We conduct experiments using 11 pre-trained models, including BERT and OpenAI's GPT 3.5, across six datasets spanning three languages. Additionally, we conduct a detailed analysis of the correlation between prediction change rates and simplification types/strengths. Our findings reveal alarming inconsistencies across all languages and models. If not promptly addressed, simplified inputs can be easily exploited to craft zero-iteration model-agnostic adversarial attacks with success rates of up to 50%

View on arXiv PDF Code

Similar