Seeking Information with RAG-Assistants: Does Model Size Matter in Human-AI Collaborations?

Lennard C. Froma, Tom Kouwenhoven, Maaike H. T. de Boer, Catholijn M. Jonker, Max J. van Duijn

arXiv:2605.0096461.5h-index: 53

AI Analysis

For researchers and practitioners designing human-AI collaborative systems, this work demonstrates that smaller models can achieve comparable user satisfaction and performance gains in RAG-based information-seeking tasks, challenging the assumption that larger models are always necessary.

This study evaluates a RAG-based chatbot assistant in a realistic information-seeking scenario with 112 human participants, finding that human-AI collaboration significantly outperforms model-only baselines regardless of model size (3B, 8B, 70B), while perceived usability and satisfaction show little difference across sizes.

Much research on LLMs has focused on increasing benchmark performance. However, the evaluation of such models in real-world collaborative human-AI workflows has stayed behind. This work evaluates a chatbot-style assistant based on Retrieval-Augmented Generation (RAG) in a realistic multi-turn information-seeking scenario inspired by workplace settings where compliance with local legislation and secure handling of sensitive data are often key. Specifically, we examine the performance of humans (N=112) assisted by RAG-assistants compared to LLM-only or LLM+RAG baselines. In this setting, we investigate how underlying model size (3B, 8B, and 70B) shapes the human-AI collaborative dynamic and how it influences perceived usability and satisfaction. Results show that the performance gain of human-AI collaboration over the model-only baselines is significant, irrespective of model size, suggesting that hybrid systems are beneficial in information-seeking scenarios. Interestingly, however, perceived usability and satisfaction among participants showed little difference across model sizes. This demonstrates a nuanced trade-off between model size, performance, and user perception. Our work highlights the added value of evaluating AI applications in actual multi-turn interactions with human users, looking at usability and satisfaction besides accuracy, rather than focusing on benchmark performance only.

View on arXiv PDF

Similar