CLAIApr 19, 2024

AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation

arXiv:2404.12753v228 citationsh-index: 22Has CodeEMNLP
Originality Incremental advance
AI Analysis

This work addresses the challenge of scalable and reusable web scraping for data collection and analysis, representing an incremental improvement over existing methods.

The paper tackles the problem of generating web scrapers that are adaptable and reusable across diverse websites by introducing a two-stage framework called AutoScraper, which leverages HTML structure and page similarity, and demonstrates its effectiveness through experiments with multiple LLMs, achieving improved performance as measured by a new executability metric.

Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts. Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website, while language agents, empowered by large language models (LLMs), exhibit poor reusability in diverse web environments. In this work, we introduce the paradigm of generating web scrapers with LLMs and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently. AutoScraper leverages the hierarchical structure of HTML and similarity across different web pages for generating web scrapers. Besides, we propose a new executability metric for better measuring the performance of web scraper generation tasks. We conduct comprehensive experiments with multiple LLMs and demonstrate the effectiveness of our framework. Resources of this paper can be found at \url{https://github.com/EZ-hwh/AutoScraper}

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes