LG AI CLSep 29, 2025

ORPO-Distill: Mixed-Policy Preference Optimization for Cross-Architecture LLM Distillation

Aasheesh Singh, Vishal Vaddina, Dagnachew Birru

arXiv:2509.25100v12 citationsh-index: 2

Originality Incremental advance

AI Analysis

This addresses the problem of efficient knowledge transfer in large language models for AI practitioners, though it appears incremental as it builds on existing distillation and preference optimization methods.

The paper tackles cross-architecture LLM distillation by formulating it as a preference optimization task, resulting in consistent improvements over conventional baselines across five datasets and multiple student models.

We introduce ORPO-Distill, a general-purpose method for cross-architecture LLM distillation that formulates the problem as a preference optimization task. Unlike standard CoT distillation, the approach transfers knowledge through diverse reasoning traces. It employs an Odds-Ratio Preference Optimization objective that contrasts teacher and student traces for more effective learning, and adopts a mixed-policy strategy for utilizing student-generated outputs, outperforming both off- and on-policy alternatives. Experiments on five datasets and multiple student models show consistent improvements over conventional black-box KD baselines.

View on arXiv PDF

Similar