"I know myself better, but not really greatly": How Well Can LLMs Detect and Explain LLM-Generated Texts?
This addresses the problem of distinguishing human- from LLM-generated texts for mitigating misuse risks, but it is incremental as it builds on existing detection methods.
The paper investigates how well large language models (LLMs) can detect and explain LLM-generated texts, finding that self-detection outperforms cross-detection but both are suboptimal, and introducing a ternary classification framework improves detection accuracy and explanation quality.
Distinguishing between human- and LLM-generated texts is crucial given the risks associated with misuse of LLMs. This paper investigates detection and explanation capabilities of current LLMs across two settings: binary (human vs. LLM-generated) and ternary classification (including an ``undecided'' class). We evaluate 6 close- and open-source LLMs of varying sizes and find that self-detection (LLMs identifying their own outputs) consistently outperforms cross-detection (identifying outputs from other LLMs), though both remain suboptimal. Introducing a ternary classification framework improves both detection accuracy and explanation quality across all models. Through comprehensive quantitative and qualitative analyses using our human-annotated dataset, we identify key explanation failures, primarily reliance on inaccurate features, hallucinations, and flawed reasoning. Our findings underscore the limitations of current LLMs in self-detection and self-explanation, highlighting the need for further research to address overfitting and enhance generalizability.