IRAIOct 24, 2024

Smart ETL and LLM-based contents classification: the European Smart Tourism Tools Observatory experience

arXiv:2410.18641v1h-index: 2
Originality Synthesis-oriented
AI Analysis

This work addresses content management and search efficiency for users of the European Smart Tourism Tools Observatory, but it is incremental as it applies existing AI methods to a specific domain.

The research tackled the problem of updating and categorizing European Smart Tourism Tools (STTs) in an online observatory by using a Smart ETL process with PDF-scraping and LLMs for classification, demonstrating the potential of LLMs for text content-based classification in preliminary results.

Purpose: Our research project focuses on improving the content update of the online European Smart Tourism Tools (STTs) Observatory by incorporating and categorizing STTs. The categorization is based on their taxonomy, and it facilitates the end user's search process. The use of a Smart ETL (Extract, Transform, and Load) process, where \emph{Smart} indicates the use of Artificial Intelligence (AI), is central to this endeavor. Methods: The contents describing STTs are derived from PDF catalogs, where PDF-scraping techniques extract QR codes, images, links, and text information. Duplicate STTs between the catalogs are removed, and the remaining ones are classified based on their text information using Large Language Models (LLMs). Finally, the data is transformed to comply with the Dublin Core metadata structure (the observatory's metadata structure), chosen for its wide acceptance and flexibility. Results: The Smart ETL process to import STTs to the observatory combines PDF-scraping techniques with LLMs for text content-based classification. Our preliminary results have demonstrated the potential of LLMs for text content-based classification. Conclusion: The proposed approach's feasibility is a step towards efficient content-based classification, not only in Smart Tourism but also adaptable to other fields. Future work will mainly focus on refining this classification process.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes