Pairwise Judgment Formulation for Semantic Embedding Model in Web Search
This work addresses a practical issue for search engine developers by improving SEM training efficiency, though it is incremental as it builds on existing data generation strategies.
The paper tackled the problem of constructing effective training data for Semantic Embedding Models (SEMs) from search engine query logs, finding that conventional Learning-to-Rank approaches are suboptimal and proposing a hybrid heuristic that outperforms simpler methods.
Semantic Embedding Models (SEMs) have become a core component in information retrieval and natural language processing due to their ability to model semantic relevance. However, despite its growing applications in search engines, few studies have systematically explored how to construct effective training data for SEMs from large-scale search engine query logs. In this paper, we present a comprehensive analysis of strategies for generating pairwise judgments as SEM training data. An interesting (perhaps surprising) discovery reveals that conventional formulation approaches used in Learning-to-Rank (LTR) are not necessarily optimal for SEM training. Through a large-scale empirical study using query logs and click-through data from a major search engine, we identify effective strategies and demonstrate the advantages of a proposed hybrid heuristic over simpler atomic heuristics. Finally, we provide best practices for SEM training and outline directions for future research.