IRAug 5, 2020

Retrieve Synonymous keywords for Frequent Queries in Sponsored Search in a Data Augmentation Way

arXiv:2008.01969v12 citations
AI Analysis

This work addresses the challenge of semantic gaps and high precision requirements in industrial sponsored search systems, leading to revenue gains for companies like Baidu, though it is incremental as it builds on existing data augmentation and filtering methods.

The paper tackles the problem of retrieving synonymous keywords for frequent queries in sponsored search to improve targeted advertising, achieving a 4.2x increase in data volume with maintained precision and an 11% absolute AUC improvement over a baseline model.

In sponsored search, retrieving synonymous keywords is of great importance for accurately targeted advertising. The semantic gap between queries and keywords and the extremely high precision requirements (>= 95\%) are two major challenges to this task. To the best of our knowledge, the problem has not been openly discussed. In an industrial sponsored search system, the retrieved keywords for frequent queries are usually done ahead of time and stored in a lookup table. Considering these results as a seed dataset, we propose a data-augmentation-like framework to improve the synonymous retrieval performance for these frequent queries. This framework comprises two steps: translation-based retrieval and discriminant-based filtering. Firstly, we devise a Trie-based translation model to make a data increment. In this phase, a Bag-of-Core-Words trick is conducted, which increased the data increment's volume 4.2 times while keeping the original precision. Then we use a BERT-based discriminant model to filter out nonsynonymous pairs, which exceeds the traditional feature-driven GBDT model with 11\% absolute AUC improvement. This method has been successfully applied to Baidu's sponsored search system, which has yielded a significant improvement in revenue. In addition, a commercial Chinese dataset containing 500K synonymous pairs with a precision of 95\% is released to the public for paraphrase study (http://ai.baidu.com/broad/subordinate?dataset=paraphrasing).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes