Analyzing the Presentation, Content, and Utilization of References in LLM-powered Conversational AI Systems
This addresses the problem of ensuring reliable and trustworthy references in conversational AI for users, though it is incremental as it builds on existing concerns without introducing new methods.
The study analyzed 1,517 references from 30 question-answer pairs across nine LLM-powered conversational AI systems, finding variations in presentation and quality, such as ChatGPT providing more references (9.5 per response) with higher quality (15.48/20 CRAAP score) compared to Hunyuan-TurboS (4.0 references, 11.65/20).
As conversational AI systems become popular for information retrieval and question-answering, the references they cite are key to ensuring their answers are reliable and trustworthy. Yet, no prior work systematically analyzes how these references are presented or their quality. We examine 1,517 references from 30 question-answer pairs across nine systems, focusing on their (1) presentation in the user interface and (2) quality using the CRAAP criteria. We find notable variations in the presentation, quality, and quantity of references across systems. For instance, ChatGPT provides more references (9.5 per response on average) with higher quality (15.48/20 CRAAP score), while Hunyuan-TurboS provides fewer references (4.0) and lower quality (11.65/20). Additionally, a preliminary user study shows that people rarely interact with these references and that their behavior differs across systems. These findings highlight the need for better interface designs that help users engage with and trust references more effectively.