IR LGSep 3, 2025

LLM-based Relevance Assessment for Web-Scale Search Evaluation at Pinterest

Han Wang, Alex Whitworth, Pak Ming Cheung, Zhenjie Zhang, Krishna Kamath

arXiv:2509.03764v22 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This work addresses the scalability and cost issues of human annotation for search evaluation at Pinterest, representing an incremental improvement by applying existing LLM methods to a specific domain.

The paper tackles the problem of automating relevance evaluation in Pinterest's search system by using fine-tuned LLMs, resulting in reliable alignment with human annotations, improved efficiency, higher-quality metrics, and a significant reduction in Minimum Detectable Effect.

Relevance evaluation plays a crucial role in personalized search systems to ensure that search results align with a user's queries and intent. While human annotation is the traditional method for relevance evaluation, its high cost and long turnaround time limit its scalability. In this work, we present our approach at Pinterest Search to automate relevance evaluation for online experiments using fine-tuned LLMs. We rigorously validate the alignment between LLM-generated judgments and human annotations, demonstrating that LLMs can provide reliable relevance measurement for experiments while greatly improving the evaluation efficiency. Leveraging LLM-based labeling further unlocks the opportunities to expand the query set, optimize sampling design, and efficiently assess a wider range of search experiences at scale. This approach leads to higher-quality relevance metrics and significantly reduces the Minimum Detectable Effect (MDE) in online experiment measurements.

View on arXiv PDF

Similar