IR LGNov 21, 2022

Directed Acyclic Graph Factorization Machines for CTR Prediction via Knowledge Distillation

Zhen Tian, Ting Bai, Zibin Zhang, Zhiyuan Xu, Kangyi Lin, Ji-Rong Wen, Wayne Xin Zhao

arXiv:2211.11159v28.828 citationsh-index: 70Has Code

Originality Highly original

AI Analysis

This addresses efficiency and accuracy challenges in industrial CTR prediction systems, offering a practical solution for handling high-dimensional sparse data with reduced computational overhead.

The paper tackles the computational cost of learning high-order feature interactions in CTR prediction for web-scale recommender systems by proposing a Directed Acyclic Graph Factorization Machine (KD-DAGFM) that uses knowledge distillation to transfer knowledge from complex teacher models to lightweight student models, achieving approximately lossless performance with less than 21.5% FLOPs of the state-of-the-art method on real-world datasets.

With the growth of high-dimensional sparse data in web-scale recommender systems, the computational cost to learn high-order feature interaction in CTR prediction task largely increases, which limits the use of high-order interaction models in real industrial applications. Some recent knowledge distillation based methods transfer knowledge from complex teacher models to shallow student models for accelerating the online model inference. However, they suffer from the degradation of model accuracy in knowledge distillation process. It is challenging to balance the efficiency and effectiveness of the shallow student models. To address this problem, we propose a Directed Acyclic Graph Factorization Machine (KD-DAGFM) to learn the high-order feature interactions from existing complex interaction models for CTR prediction via Knowledge Distillation. The proposed lightweight student model DAGFM can learn arbitrary explicit feature interactions from teacher networks, which achieves approximately lossless performance and is proved by a dynamic programming algorithm. Besides, an improved general model KD-DAGFM+ is shown to be effective in distilling both explicit and implicit feature interactions from any complex teacher model. Extensive experiments are conducted on four real-world datasets, including a large-scale industrial dataset from WeChat platform with billions of feature dimensions. KD-DAGFM achieves the best performance with less than 21.5% FLOPs of the state-of-the-art method on both online and offline experiments, showing the superiority of DAGFM to deal with the industrial scale data in CTR prediction task. Our implementation code is available at: https://github.com/RUCAIBox/DAGFM.

View on arXiv PDF Code

Similar