IR AI LGOct 22, 2024

AmazonQAC: A Large-Scale, Naturalistic Query Autocomplete Dataset

Dante Everaert, Rohit Patki, Tianqi Zheng, Christopher Potts

arXiv:2411.04129v120.623 citationsh-index: 2Has CodeEMNLP

Originality Synthesis-oriented

AI Analysis

This provides a valuable dataset for researchers and practitioners in search engine development, though it is incremental as it focuses on data creation rather than a new method.

The paper tackles the lack of large-scale, realistic datasets for Query Autocomplete (QAC) by introducing AmazonQAC, a dataset with 395M samples from Amazon Search logs, and finds that finetuned LLMs perform best but achieve only half of the theoretical maximum performance.

Query Autocomplete (QAC) is a critical feature in modern search engines, facilitating user interaction by predicting search queries based on input prefixes. Despite its widespread adoption, the absence of large-scale, realistic datasets has hindered advancements in QAC system development. This paper addresses this gap by introducing AmazonQAC, a new QAC dataset sourced from Amazon Search logs, comprising 395M samples. The dataset includes actual sequences of user-typed prefixes leading to final search terms, as well as session IDs and timestamps that support modeling the context-dependent aspects of QAC. We assess Prefix Trees, semantic retrieval, and Large Language Models (LLMs) with and without finetuning. We find that finetuned LLMs perform best, particularly when incorporating contextual information. However, even our best system achieves only half of what we calculate is theoretically possible on our test data, which implies QAC is a challenging problem that is far from solved with existing systems. This contribution aims to stimulate further research on QAC systems to better serve user needs in diverse environments. We open-source this data on Hugging Face at https://huggingface.co/datasets/amazon/AmazonQAC.

View on arXiv PDF

Similar