DBAILGFeb 20, 2025

Real-Time Device Reach Forecasting Using HLL and MinHash Data Sketches

arXiv:2502.14785v1h-index: 22020 7th International Conference on Soft Computing & Machine Intelligence (ISCMI)
Originality Incremental advance
AI Analysis

This solves the slow real-time prediction issue for ad businesses, enabling faster customer onboarding and reducing potential revenue loss, though it is incremental in improving existing methods.

The paper tackled the problem of real-time device reach forecasting for ad targeting by developing a system using MinHash and HyperLogLog data sketches, achieving results as accurate as traditional offline methods with a 5% error rate and running 4 times faster.

Predicting the right number of TVs (Device Reach) in real-time based on a user-specified targeting attributes is imperative for running multi-million dollar ADs business. The traditional approach of SQL queries to join billions of records across multiple targeting dimensions is extremely slow. As a workaround, many applications will have an offline process to crunch these numbers and present the results after many hours. In our case, the solution was an offline process taking 24 hours to onboard a customer resulting in a potential loss of business. To solve this problem, we have built a new real-time prediction system using MinHash and HyperLogLog (HLL) data sketches to compute the device reach at runtime when a user makes a request. However, existing MinHash implementations do not solve the complex problem of multilevel aggregation and intersection. This work will show how we have solved this problem, in addition, we have improved MinHash algorithm to run 4 times faster using Single Instruction Multiple Data (SIMD) vectorized operations for high speed and accuracy with constant space to process billions of records. Finally, by experiments, we prove that the results are as accurate as traditional offline prediction system with an acceptable error rate of 5%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes