FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents
This addresses the need for scalable and transferable information extraction on web documents, benefiting applications like knowledge base augmentation and domain-specific experiences, with incremental improvements over prior neural methods.
The paper tackles the problem of extracting structured data from HTML documents by introducing FreeDOM, a two-stage neural approach that generalizes to unseen sites without requiring visual renderings or handcrafted features, achieving a 3.7 F1 point improvement over previous state-of-the-art methods on a public dataset across 8 verticals.
Extracting structured data from HTML documents is a long-studied problem with a broad range of applications like augmenting knowledge bases, supporting faceted search, and providing domain-specific experiences for key verticals like shopping and movies. Previous approaches have either required a small number of examples for each target site or relied on carefully handcrafted heuristics built over visual renderings of websites. In this paper, we present a novel two-stage neural approach, named FreeDOM, which overcomes both these limitations. The first stage learns a representation for each DOM node in the page by combining both the text and markup information. The second stage captures longer range distance and semantic relatedness using a relational neural network. By combining these stages, FreeDOM is able to generalize to unseen sites after training on a small number of seed sites from that vertical without requiring expensive hand-crafted features over visual renderings of the page. Through experiments on a public dataset with 8 different verticals, we show that FreeDOM beats the previous state of the art by nearly 3.7 F1 points on average without requiring features over rendered pages or expensive hand-crafted features.