MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations
This work addresses the challenge of understanding shifts in human behavior during conversations for applications in social computing and AI, but it is incremental as it builds on existing multi-modal and language model techniques.
The paper tackles the problem of detecting critical moments, or turning points, in casual conversations by introducing a novel multi-modal dataset with precise annotations and a framework called TPMaven that uses vision-language and large language models. The result shows TPMaven achieves an F1-score of 0.88 in classification and 0.61 in detection, with explanations aligning with human expectations.
Detecting critical moments, such as emotional outbursts or changes in decisions during conversations, is crucial for understanding shifts in human behavior and their consequences. Our work introduces a novel problem setting focusing on these moments as turning points (TPs), accompanied by a meticulously curated, high-consensus, human-annotated multi-modal dataset. We provide precise timestamps, descriptions, and visual-textual evidence high-lighting changes in emotions, behaviors, perspectives, and decisions at these turning points. We also propose a framework, TPMaven, utilizing state-of-the-art vision-language models to construct a narrative from the videos and large language models to classify and detect turning points in our multi-modal dataset. Evaluation results show that TPMaven achieves an F1-score of 0.88 in classification and 0.61 in detection, with additional explanations aligning with human expectations.