Method Drift›LLM reasoning / chain-of-thought
Superseded baseline#262 of 772 most-superseded
DPO (Direct Preference Optimization)
LLM reasoning / chain-of-thought
superseded — cited as a baseline and beaten by newer methods
1 papers critique it · 0 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites DPO (Direct Preference Optimization) as a baseline.
“While alternatives like Direct Preference Optimization (DPO) simplify the process by using static preference data, their efficacy is limited in tasks requiring dynamic interaction.”
— Empowering Lightweight MLLMs with Reasoning via Long CoT SFT
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- Jun 7, 2026