H-CoT (LLM reasoning / chain-of-thought): superseded — cited as a baseline and beaten by newer methods. 2 paper(s) critique it, 2 beat it on benchmarks — #11 of 772 most-superseded. Sub-problem: cluster led by H-CoT. Newer alternatives in the same sub-problem include AE-CoT.

Method Drift›LLM reasoning / chain-of-thought

Superseded baseline#11 of 772 most-superseded

H-CoT

H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking

LLM reasoning / chain-of-thought · first seen Feb 18, 2025

superseded — cited as a baseline and beaten by newer methods

2 papers critique it · 2 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites H-CoT as a baseline.

“it remains ineffective on the latest o3 and o4-Mini”
— Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
“While effective in some cases, such approaches suffer from three major limitations. First, their reliance on fixed templates restricts diversity, making attacks easier to detect or defend against. Second, they lack adaptability to different models and contexts, limiting their robustness. Third, their overall effectiveness is constrained, as static designs fail to fully exploit the dynamic nature of CoT reasoning.”
— Reasoning as an Attack Surface: Adaptive Evolutionary CoT Jailbreaks for LLMs

Beaten on benchmarks

Head-to-head results where a newer method reports beating H-CoT. Values are copied from the source paper's tables — verify against the cited paper.

DH-CoT-D12 beats H-CoT · ASR [GPT-5]
0.90 vs 0.54
Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
DH-CoT-D12 beats H-CoT · ASR [GPT-5.1]
0.92 vs 0.84
Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
DH-CoT-D12 beats H-CoT · ASR [o3]
0.40 vs 0.20
Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
DH-CoT-D12 beats H-CoT · ASR [o4-Mini]
0.58 vs 0.40
Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
DH-CoT-D12 beats H-CoT · ASR [Gemini-2.5-Flash-Thinking]
1.00 vs 0.98
Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
DH-CoT-D12 beats H-CoT · ASR [Claude-3-7-Sonnet-Thinking]
0.82 vs 0.14
Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
DH-CoT-D12 beats H-CoT · ASR [Claude-Sonnet-4-Thinking]
0.20 vs 0.08
Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
DH-CoT-D12 beats H-CoT · ASR [DeepSeek-R1]
1.00 vs 0.56
Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
AE-CoT (ours) beats H-CoT · ASR [o1-mini]
92 vs 54
Reasoning as an Attack Surface: Adaptive Evolutionary CoT Jailbreaks for LLMs
AE-CoT (ours) beats H-CoT · HS [o1-mini]
70.4 vs 60
Reasoning as an Attack Surface: Adaptive Evolutionary CoT Jailbreaks for LLMs
AE-CoT (ours) beats H-CoT · ASR [o3-mini]
88 vs 86
Reasoning as an Attack Surface: Adaptive Evolutionary CoT Jailbreaks for LLMs
AE-CoT (ours) beats H-CoT · HS [o3-mini]
72.0 vs 70.4
Reasoning as an Attack Surface: Adaptive Evolutionary CoT Jailbreaks for LLMs

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.

AE-CoT Reasoning as an Attack Surface: Adaptive Evolutionary CoT Jailbreaks for LLMs
May 23, 2026