IRAICLDec 29, 2024

The Synergy of Automated Pipelines with Prompt Engineering and Generative AI in Web Crawling

arXiv:2502.15691v1
Originality Synthesis-oriented
AI Analysis

This work addresses web scraping challenges for data extraction practitioners, but it is incremental as it applies existing AI methods to a specific domain.

This study tackled the challenge of automating web crawling by integrating generative AI tools (Claude AI and ChatGPT-4.0) with prompt engineering, showing that Claude AI consistently outperformed ChatGPT-4.0 in script quality and adaptability based on predefined metrics.

Web crawling is a critical technique for extracting online data, yet it poses challenges due to webpage diversity and anti-scraping mechanisms. This study investigates the integration of generative AI tools Claude AI (Sonnet 3.5) and ChatGPT4.0 with prompt engineering to automate web scraping. Using two prompts, PROMPT I (general inference, tested on Yahoo News) and PROMPT II (element-specific, tested on Coupons.com), we evaluate the code quality and performance of AI-generated scripts. Claude AI consistently outperformed ChatGPT-4.0 in script quality and adaptability, as confirmed by predefined evaluation metrics, including functionality, readability, modularity, and robustness. Performance data were collected through manual testing and structured scoring by three evaluators. Visualizations further illustrate Claude AI's superiority. Anti-scraping solutions, including undetected_chromedriver, Selenium, and fake_useragent, were incorporated to enhance performance. This paper demonstrates how generative AI combined with prompt engineering can simplify and improve web scraping workflows.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes