Victor Herrero-Solana

2papers

2 Papers

8.2DLApr 17
News Harvesting from Google News combining Web Scraping, LLM Metadata Extraction and SCImago Media Rankings enrichment: a case study of IFMIF-DONES

Victor Herrero-Solana

This study develops and evaluates a systematic methodology for constructing news datasets from Google News, combining automated web scraping, large language model (LLM)-based metadata extraction, and SCImago Media Rankings enrichment. Using the IFMIF-DONES fusion energy project as a case study, we implemented a five-stage data collection pipeline across 81 region-language combinations, yielding 1,482 validated records after a 56% noise reduction. Results are compared against two licensed press databases: MyNews (2,280 records) and ProQuest Newsstream Collection (148 records). Overlap analysis reveals high complementarity, with 76% of Google News records exclusive to this platform. The dataset captures content types absent from proprietary databases, including specialized outlets, institutional communications, and social media posts. However, significant methodological challenges emerge: temporal instability requiring synchronic collection, a 100-result cap per query demanding multi-stage strategies, and unexpected noise including academic PDFs, false positives, and pornographic content infiltrating results through black hat SEO techniques. LLM-assisted extraction proved effective for structured articles but exhibited systematic hallucination patterns requiring validation protocols. We conclude that Google News offers valuable complementary coverage for communication research but demands substantial methodological investment, multi-source triangulation, and robust filtering mechanisms to ensure dataset integrity.

DLFeb 16
Artificial Intelligence Specialization in the European Union: Underexplored Role of the Periphery at NUTS-3 Level

Victor Herrero-Solana

This study examines the geographical distribution of Artificial Intelligence (AI) research production across European regions at the NUTS-3 level for the period 2015-2024. Using bibliometric data from Clarivate InCites and the Citation Topics classification system, we analyze two hierarchical levels of thematic aggregation: Electrical Engineering, Electronics & Computer Science (Macro Citation Topic 4) and Artificial Intelligence & Machine Learning (Meso Citation Topic 4.61). We calculate the Relative Specialization Index (RSI) and Relative Citation Impact (RCI) for 781 NUTS-3 regions. While major metropolitan hubs such as Paris (IIle-de-France), Warszawa, and Madrid lead in absolute production volume, our findings reveal that peripheral regions, particularly from Eastern Europe and Spain, exhibit the highest levels of relative AI specialization. Notably, we find virtually no correlation between regional specialization and citation impact, identifying four distinct regional profiles: high-impact specialized regions (e.g., Granada, Jaen, Vilniaus), high-volume but low-impact regions (e.g., Bugas, several Polish regions), high-impact non-specialized regions, with Fyn (Denmark) standing out as a remarkable outlier achieving exceptional citation impact (RCI > 4) despite low specialization, and diversified portfolios with selective excellence (e.g., German regions). These results suggest that AI research represents a strategic opportunity for peripheral regions to develop competitive scientific niches, though achieving international visibility requires more than research volume alone.