IRJul 12, 2021

Inscriptis -- A Python-based HTML to text conversion library optimized for knowledge extraction from the Web

arXiv:2108.01454v213 citations
Originality Incremental advance
AI Analysis

This work addresses the need for lightweight, accurate HTML-to-text conversion tools for researchers and developers in web-based knowledge extraction, offering an incremental improvement over existing solutions.

The authors tackled the problem of converting HTML to plain text for knowledge extraction by developing Inscriptis, a Python library that provides layout-aware conversion and supports annotation rules, resulting in more accurate text representations that preserve spatial alignment and structural semantics.

Inscriptis provides a library, command line client and Web service for converting HTML to plain text. Its development has been triggered by the need to obtain accurate text representations for knowledge extraction tasks that preserve the spatial alignment of text without drawing upon heavyweight, browser-based solutions such as Selenium. In contrast to related software packages, Inscriptis (i) provides a layout-aware conversion of HTML that more closely resembles the rendering obtained from standard Web browsers; and (ii) supports annotation rules, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes. These unique features ensure that downstream knowledge extraction components can operate on accurate text representations, and may even use information on the semantics and structure of the original HTML document.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes