CLMay 5

Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing

Humam Khan, Md Tabrez Nafis, Shahab Saquib Sohail, Aqeel Khalique, Rehan Hasan Khan

arXiv:2605.0417184.7h-index: 24

Predicted impact top 57% in CL · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers and practitioners using LLMs for academic writing, this work provides a comparative analysis of hallucination tendencies across models and task types, though it is incremental as it applies existing evaluation methods to a new domain.

The study evaluated four LLMs (ChatGPT, Grok, Gemini, Copilot) for hallucinations in academic writing using 80 prompts across four categories. Grok and Copilot performed better on reference generation (HI 0.67, 0.70) but struggled with abstract/stylistic prompts, while Gemini and ChatGPT had better tone control but higher hallucination risk (HI 0.53, 0.57).

Large Language models (LLMs) show extraordinary abilities, but they are still prone to hallucinations, especially when we use them for generating Academic content. We have investigated four popular LLMs, ChatGPT, Grok, Gemini, and Copilot for hallucinations specifically for academic writing. We have designed 80 prompts across four categories, namely, reference generation, factual explanation, abstract generation, and writing improvement. We evaluated the model using a 0-5 rubric score, which checks factual accuracy, reference validity, coherence, style consistency, and academic tone. A novel weighted metric, Hallucination Index (HI), was introduced to measure hallucination in the responses generated by the models. Some of the most widely used evaluation metrics often fail to check errors which alter sentiment in machine-translated text. We found that Grok and Copilot perform better on reference generation tasks, but they often struggle with abstract or stylistic prompts, with HI values of 0.67 and 0.70, respectively. Whereas, Gemini and ChatGPT have done well with having stronger tone control, but they lack in writing factual tasks and higher hallucination risk with HI scores of 0.53 and 0.57, respectively. Our study found that hallucination behavior does not depend solely on model architecture but also on the type of task and the prompting conditions we are providing. We propose that our work opens new research dimensions for future researchers.

View on arXiv PDF

Similar