The DeepSpeak-Agentic Dataset
This dataset and benchmark serve researchers studying AI-generated media detection and human-agent interaction, but the contribution is primarily a new resource rather than a novel method or breakthrough.
The authors present DeepSpeak-Agentic, a 37+ hour video dataset of human-AI conversations, and use it to benchmark forensic identification of AI agents in audio, video, and text, while also providing a scalable data-capture system.
We present DeepSpeak-Agentic, a dataset of videos comprising over 37 hours of semi-structured conversations between a human and an embodied AI agent. We use this dataset to evaluate the automatic forensic identification (audio, video, or text) of AI agents, study the nature of human-agent interactions, and provide a benchmark for future advances in the large-language models and AI-generated voices and faces that power embodied AI agents. We also contribute a scalable data-capture system that creates agents, automatically pairs them with human crowd workers, records audiovisual conversations across specified scenarios, and identifies and separates the human and agent in the combined stream.