Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations
This work addresses the challenge of making multimodal AI systems safer and more reliable in social situations, representing an incremental improvement over chain-of-thought prompting.
The paper tackled the problem of improving visual language models' reasoning in social contexts by introducing Cognitive Chain-of-Thought (CoCoT), a prompting strategy with three stages, resulting in an average performance gain of +8% over existing methods on multimodal benchmarks.
Chain-of-Thought (CoT) prompting helps models think step by step. But what happens when they must see, understand, and judge-all at once? In visual tasks grounded in social context, where bridging perception with norm-grounded judgments is essential, flat CoT often breaks down. We introduce Cognitive Chain-of-Thought (CoCoT), a prompting strategy that scaffolds VLM reasoning through three cognitively inspired stages: perception, situation, and norm. Our experiments show that, across multiple multimodal benchmarks (including intent disambiguation, commonsense reasoning, and safety), CoCoT consistently outperforms CoT and direct prompting (+8\% on average). Our findings demonstrate that cognitively grounded reasoning stages enhance interpretability and social awareness in VLMs, paving the way for safer and more reliable multimodal systems.