Evaluation of LLM-based Strategies for the Extraction of Food Product Information from Online Shops
This work addresses the challenge of scalable information extraction for e-commerce and data analysis, though it is incremental as it builds on existing LLM methods for a specific domain.
The study tackled the problem of extracting structured food product information from online shop web pages by comparing two LLM-based strategies, finding that an indirect extraction approach reduced LLM calls by 95.82% with a slight accuracy drop to 96.48%, offering efficiency and cost benefits.
Generative AI and large language models (LLMs) offer significant potential for automating the extraction of structured information from web pages. In this work, we focus on food product pages from online retailers and explore schema-constrained extraction approaches to retrieve key product attributes, such as ingredient lists and nutrition tables. We compare two LLM-based approaches, direct extraction and indirect extraction via generated functions, evaluating them in terms of accuracy, efficiency, and cost on a curated dataset of 3,000 food product pages from three different online shops. Our results show that although the indirect approach achieves slightly lower accuracy (96.48\%, $-1.61\%$ compared to direct extraction), it reduces the number of required LLM calls by 95.82\%, leading to substantial efficiency gains and lower operational costs. These findings suggest that indirect extraction approaches can provide scalable and cost-effective solutions for large-scale information extraction tasks from template-based web pages using LLMs.