IRAILGFeb 26

Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

arXiv:2602.23234v2h-index: 5
Originality Highly original
AI Analysis

This work provides a method for improving search relevance, particularly for tail queries, in large-scale commercial search systems like app stores, benefiting users by helping them find what they are looking for more efficiently.

This paper addresses the scarcity of expert-provided textual relevance labels in large-scale commercial search systems by using a fine-tuned LLM to generate millions of such labels. Augmenting the production ranker with these labels led to a significant +0.24% increase in conversion rate in a worldwide A/B test on the App Store, with notable improvements for tail queries.

Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. To maximize relevance, we leverage two complementary objectives: behavioral relevance (results users tend to click or download) and textual relevance (a result's semantic fit to the query). A persistent challenge is the scarcity of expert-provided textual relevance labels relative to abundant behavioral relevance labels. We first address this by systematically evaluating LLM configurations, finding that a specialized, fine-tuned model significantly outperforms a much larger pre-trained one in providing highly relevant labels. Using this optimal model as a force multiplier, we generate millions of textual relevance labels to overcome the data scarcity. We show that augmenting our production ranker with these textual relevance labels leads to a significant outward shift of the Pareto frontier: offline NDCG improves for behavioral relevance while simultaneously increasing for textual relevance. These offline gains were validated by a worldwide A/B test on the App Store ranker, which demonstrated a statistically significant +0.24% increase in conversion rate, with the most substantial performance gains occurring in tail queries, where the new textual relevance labels provide a robust signal in the absence of reliable behavioral relevance labels.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes