CL AIFeb 25, 2025

KatFishNet: Detecting LLM-Generated Korean Text through Linguistic Feature Analysis

Shinwoo Park, Shubin Kim, Do-Kyung Kim, Yo-Sub Han

arXiv:2503.00032v510.96 citationsh-index: 5Has CodeACL

Originality Incremental advance

AI Analysis

This addresses the need for specialized detection tools in non-English languages like Korean to uphold academic integrity and prevent plagiarism, representing an incremental advance over English-focused methods.

The paper tackles the problem of detecting LLM-generated Korean text by analyzing linguistic features like spacing patterns and part-of-speech diversity, resulting in KatFishNet achieving a 19.78% higher AUROC than existing methods.

The rapid advancement of large language models (LLMs) increases the difficulty of distinguishing between human-written and LLM-generated text. Detecting LLM-generated text is crucial for upholding academic integrity, preventing plagiarism, protecting copyrights, and ensuring ethical research practices. Most prior studies on detecting LLM-generated text focus primarily on English text. However, languages with distinct morphological and syntactic characteristics require specialized detection approaches. Their unique structures and usage patterns can hinder the direct application of methods primarily designed for English. Among such languages, we focus on Korean, which has relatively flexible spacing rules, a rich morphological system, and less frequent comma usage compared to English. We introduce KatFish, the first benchmark dataset for detecting LLM-generated Korean text. The dataset consists of text written by humans and generated by four LLMs across three genres. By examining spacing patterns, part-of-speech diversity, and comma usage, we illuminate the linguistic differences between human-written and LLM-generated Korean text. Building on these observations, we propose KatFishNet, a detection method specifically designed for the Korean language. KatFishNet achieves an average of 19.78% higher AUROC compared to the best-performing existing detection method. Our code and data are available at https://github.com/Shinwoo-Park/detecting_llm_generated_korean_text_through_linguistic_analysis.

View on arXiv PDF Code

Similar