IRLGMLJul 16, 2018

Pangloss: Fast Entity Linking in Noisy Text Environments

arXiv:1807.06036v15 citations
Originality Incremental advance
AI Analysis

It addresses the problem of entity disambiguation for applications like semantic search and knowledge graph construction in real-world noisy datasets such as social media, though it is incremental in its approach.

The paper tackles entity linking in noisy text environments, presenting Pangloss, a production system that achieves over 5% improvement in F1 score compared to state-of-the-art systems.

Entity linking is the task of mapping potentially ambiguous terms in text to their constituent entities in a knowledge base like Wikipedia. This is useful for organizing content, extracting structured data from textual documents, and in machine learning relevance applications like semantic search, knowledge graph construction, and question answering. Traditionally, this work has focused on text that has been well-formed, like news articles, but in common real world datasets such as messaging, resumes, or short-form social media, non-grammatical, loosely-structured text adds a new dimension to this problem. This paper presents Pangloss, a production system for entity disambiguation on noisy text. Pangloss combines a probabilistic linear-time key phrase identification algorithm with a semantic similarity engine based on context-dependent document embeddings to achieve better than state-of-the-art results (>5% in F1) compared to other research or commercially available systems. In addition, Pangloss leverages a local embedded database with a tiered architecture to house its statistics and metadata, which allows rapid disambiguation in streaming contexts and on-device disambiguation in low-memory environments such as mobile phones.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes