Can We Still Hear the Accent? Investigating the Resilience of Native Language Signals in the LLM Era
This addresses the problem of linguistic homogenization in academic writing for researchers and linguists, with incremental findings on specific language trends.
The study investigated whether the shift to large language models (LLMs) is homogenizing research writing by analyzing native language identification (NLI) trends in ACL Anthology papers across three eras, finding a consistent decline in NLI performance with anomalies such as Chinese and French showing resistance and Japanese and Korean exhibiting sharper declines.
The evolution of writing assistance tools from machine translation to large language models (LLMs) has changed how researchers write. This study investigates whether this shift is homogenizing research papers by analyzing native language identification (NLI) trends in ACL Anthology papers across three eras: pre-neural network (NN), pre-LLM, and post-LLM. We construct a labeled dataset using a semi-automated framework and fine-tune a classifier to detect linguistic fingerprints of author backgrounds. Our analysis shows a consistent decline in NLI performance over time. Interestingly, the post-LLM era reveals anomalies: while Chinese and French show unexpected resistance or divergent trends, Japanese and Korean exhibit sharper-than-expected declines.