AI LGFeb 2

Efficient Cross-Architecture Knowledge Transfer for Large-Scale Online User Response Prediction

Yucheng Wu, Yuekui Yang, Hongzheng Li, Anan Liu, Jian Xiao, Junjie Zhai, Huan Yu, Shaoping Ma, Leye Wang

arXiv:2602.01775v12.4h-index: 4

Originality Incremental advance

AI Analysis

This addresses the challenge of deploying new architectures in online systems like Tencent WeChat Channels, reducing costs and performance degradation, but it is incremental as it builds on knowledge distillation methods.

The paper tackled the problem of high model switching costs in large-scale user response prediction systems by proposing CrossAdapt, a two-stage framework for efficient cross-architecture knowledge transfer, achieving 0.27-0.43% AUC improvements and reducing training time by 43-71% in experiments.

Deploying new architectures in large-scale user response prediction systems incurs high model switching costs due to expensive retraining on massive historical data and performance degradation under data retention constraints. Existing knowledge distillation methods struggle with architectural heterogeneity and the prohibitive cost of transferring large embedding tables. We propose CrossAdapt, a two-stage framework for efficient cross-architecture knowledge transfer. The offline stage enables rapid embedding transfer via dimension-adaptive projections without iterative training, combined with progressive network distillation and strategic sampling to reduce computational cost. The online stage introduces asymmetric co-distillation, where students update frequently while teachers update infrequently, together with a distribution-aware adaptation mechanism that dynamically balances historical knowledge preservation and fast adaptation to evolving data. Experiments on three public datasets show that CrossAdapt achieves 0.27-0.43% AUC improvements while reducing training time by 43-71%. Large-scale deployment on Tencent WeChat Channels (~10M daily samples) further demonstrates its effectiveness, significantly mitigating AUC degradation, LogLoss increase, and prediction bias compared to standard distillation baselines.

View on arXiv PDF

Similar