IRLGNov 21, 2022

Directed Acyclic Graph Factorization Machines for CTR Prediction via Knowledge Distillation

arXiv:2211.11159v228 citationsh-index: 70Has Code
Originality Highly original
AI Analysis

This addresses efficiency and accuracy challenges in industrial CTR prediction systems, offering a practical solution for handling high-dimensional sparse data with reduced computational overhead.

The paper tackles the computational cost of learning high-order feature interactions in CTR prediction for web-scale recommender systems by proposing a Directed Acyclic Graph Factorization Machine (KD-DAGFM) that uses knowledge distillation to transfer knowledge from complex teacher models to lightweight student models, achieving approximately lossless performance with less than 21.5% FLOPs of the state-of-the-art method on real-world datasets.

With the growth of high-dimensional sparse data in web-scale recommender systems, the computational cost to learn high-order feature interaction in CTR prediction task largely increases, which limits the use of high-order interaction models in real industrial applications. Some recent knowledge distillation based methods transfer knowledge from complex teacher models to shallow student models for accelerating the online model inference. However, they suffer from the degradation of model accuracy in knowledge distillation process. It is challenging to balance the efficiency and effectiveness of the shallow student models. To address this problem, we propose a Directed Acyclic Graph Factorization Machine (KD-DAGFM) to learn the high-order feature interactions from existing complex interaction models for CTR prediction via Knowledge Distillation. The proposed lightweight student model DAGFM can learn arbitrary explicit feature interactions from teacher networks, which achieves approximately lossless performance and is proved by a dynamic programming algorithm. Besides, an improved general model KD-DAGFM+ is shown to be effective in distilling both explicit and implicit feature interactions from any complex teacher model. Extensive experiments are conducted on four real-world datasets, including a large-scale industrial dataset from WeChat platform with billions of feature dimensions. KD-DAGFM achieves the best performance with less than 21.5% FLOPs of the state-of-the-art method on both online and offline experiments, showing the superiority of DAGFM to deal with the industrial scale data in CTR prediction task. Our implementation code is available at: https://github.com/RUCAIBox/DAGFM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes