CLIROct 21, 2020

FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents

arXiv:2010.10755v150 citations
Originality Highly original
AI Analysis

This addresses the need for scalable and transferable information extraction on web documents, benefiting applications like knowledge base augmentation and domain-specific experiences, with incremental improvements over prior neural methods.

The paper tackles the problem of extracting structured data from HTML documents by introducing FreeDOM, a two-stage neural approach that generalizes to unseen sites without requiring visual renderings or handcrafted features, achieving a 3.7 F1 point improvement over previous state-of-the-art methods on a public dataset across 8 verticals.

Extracting structured data from HTML documents is a long-studied problem with a broad range of applications like augmenting knowledge bases, supporting faceted search, and providing domain-specific experiences for key verticals like shopping and movies. Previous approaches have either required a small number of examples for each target site or relied on carefully handcrafted heuristics built over visual renderings of websites. In this paper, we present a novel two-stage neural approach, named FreeDOM, which overcomes both these limitations. The first stage learns a representation for each DOM node in the page by combining both the text and markup information. The second stage captures longer range distance and semantic relatedness using a relational neural network. By combining these stages, FreeDOM is able to generalize to unseen sites after training on a small number of seed sites from that vertical without requiring expensive hand-crafted features over visual renderings of the page. Through experiments on a public dataset with 8 different verticals, we show that FreeDOM beats the previous state of the art by nearly 3.7 F1 points on average without requiring features over rendered pages or expensive hand-crafted features.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes