CLAIHCDec 19, 2025

ShareChat: A Dataset of Chatbot Conversations in the Wild

arXiv:2512.17843v33 citationsh-index: 3
Originality Synthesis-oriented
AI Analysis

This provides a vital resource for researchers studying real user-LLM interactions, though it is incremental as it focuses on data collection rather than new methods.

The authors tackled the lack of authentic chatbot conversation data by creating ShareChat, a large-scale dataset of 142,808 conversations from real-world platforms, which reveals diverse user behaviors and platform-specific features.

While academic research typically treats Large Language Models (LLM) as generic text generators, they are distinct commercial products with unique interfaces and capabilities that fundamentally shape user behavior. Current datasets obscure this reality by collecting text-only data through uniform interfaces that fail to capture authentic chatbot usage. To address this limitation, we present ShareChat, a large-scale corpus of 142,808 conversations (660,293 turns) sourced directly from publicly shared URLs on ChatGPT, Perplexity, Grok, Gemini, and Claude. ShareChat distinguishes itself by preserving native platform affordances, such as citations and thinking traces, across a diverse collection covering 101 languages and the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. To illustrate the dataset's breadth, we present three case studies: a completeness analysis of intent satisfaction, a citation study of model grounding, and a temporal analysis of engagement rhythms. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild. The dataset is publicly available via Hugging Face.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes