An Evaluation of Explanation Methods for Black-Box Detectors of Machine-Generated Text
This addresses the need for interpretability in machine-generated text detection, which is crucial for users in fields like content moderation and education, though it is incremental as it builds on existing explanation methods.
The study evaluated explanation methods for black-box detectors of machine-generated text, finding that SHAP performed best in faithfulness and stability, while LIME was perceived as most useful by users but scored worst in helping them predict detector behavior.
The increasing difficulty to distinguish language-model-generated from human-written text has led to the development of detectors of machine-generated text (MGT). However, in many contexts, a black-box prediction is not sufficient, it is equally important to know on what grounds a detector made that prediction. Explanation methods that estimate feature importance promise to provide indications of which parts of an input are used by classifiers for prediction. However, these are typically evaluated with simple classifiers and tasks that are intuitive to humans. To assess their suitability beyond these contexts, this study conducts the first systematic evaluation of explanation quality for detectors of MGT. The dimensions of faithfulness and stability are evaluated with five automated experiments, and usefulness is assessed in a user study. We use a dataset of ChatGPT-generated and human-written documents, and pair predictions of three existing language-model-based detectors with the corresponding SHAP, LIME, and Anchor explanations. We find that SHAP performs best in terms of faithfulness, stability, and in helping users to predict the detector's behavior. In contrast, LIME, perceived as most useful by users, scores the worst in terms of user performance at predicting detector behavior.