MortalMATH: Evaluating the Conflict Between Reasoning Objectives and Emergency Contexts
This highlights a critical safety problem for AI deployment in real-world scenarios, showing an incremental but important conflict between reasoning objectives and emergency responsiveness.
The paper investigates whether large language models optimized for reasoning ignore safety in emergencies, finding that specialized reasoning models often ignore life-threatening situations while maintaining high task completion rates, with delays up to 15 seconds.
Large Language Models are increasingly optimized for deep reasoning, prioritizing the correct execution of complex tasks over general conversation. We investigate whether this focus on calculation creates a "tunnel vision" that ignores safety in critical situations. We introduce MortalMATH, a benchmark of 150 scenarios where users request algebra help while describing increasingly life-threatening emergencies (e.g., stroke symptoms, freefall). We find a sharp behavioral split: generalist models (like Llama-3.1) successfully refuse the math to address the danger. In contrast, specialized reasoning models (like Qwen-3-32b and GPT-5-nano) often ignore the emergency entirely, maintaining over 95 percent task completion rates while the user describes dying. Furthermore, the computational time required for reasoning introduces dangerous delays: up to 15 seconds before any potential help is offered. These results suggest that training models to relentlessly pursue correct answers may inadvertently unlearn the survival instincts required for safe deployment.