IRAILGJul 11, 2024

CADC: Encoding User-Item Interactions for Compressing Recommendation Model Training Data

arXiv:2407.08108v21 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses training efficiency for e-commerce recommendation systems, offering a practical solution to data compression with minimal accuracy loss, though it is incremental as it builds on existing matrix factorization and sampling techniques.

The paper tackles the problem of exponentially growing training data for deep learning recommendation models by proposing CADC, a method that compresses the dataset by enriching user and item embeddings with interaction history via matrix factorization and then applying uniform random sampling, achieving a 50% reduction in data size with only a 0.5% drop in accuracy.

Deep learning recommendation models (DLRMs) are at the heart of the current e-commerce industry. However, the amount of training data used to train these large models is growing exponentially, leading to substantial training hurdles. The training dataset contains two primary types of information: content-based information (features of users and items) and collaborative information (interactions between users and items). One approach to reduce the training dataset is to remove user-item interactions. But that significantly diminishes collaborative information, which is crucial for maintaining accuracy due to its inclusion of interaction histories. This loss profoundly impacts DLRM performance. This paper makes an important observation that if one can capture the user-item interaction history to enrich the user and item embeddings, then the interaction history can be compressed without losing model accuracy. Thus, this work, Collaborative Aware Data Compression (CADC), takes a two-step approach to training dataset compression. In the first step, we use matrix factorization of the user-item interaction matrix to create a novel embedding representation for both the users and items. Once the user and item embeddings are enriched by the interaction history information the approach then applies uniform random sampling of the training dataset to drastically reduce the training dataset size while minimizing model accuracy drop. The source code of CADC is available at \href{https://anonymous.4open.science/r/DSS-RM-8C1D/README.md}{https://anonymous.4open.science/r/DSS-RM-8C1D/README.md}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes