CL AIFeb 22, 2016

Empath: Understanding Topic Signals in Large-Scale Text

Ethan Fast, Binbin Chen, Michael Bernstein

arXiv:1602.06979v121.7434 citationsHas Code

Originality Incremental advance

AI Analysis

This tool addresses the need for more comprehensive topic analysis in text for researchers and practitioners, though it is incremental as it builds on existing methods like neural embeddings and crowd validation.

The authors tackled the problem of limited topic coverage in text analysis tools by developing Empath, a tool that generates and validates new lexical categories from seed terms using a neural embedding trained on 1.8 billion words, achieving a high correlation (r=0.906) with existing categories in LIWC.

Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like "bleed" and "punch" to generate the category violence). Empath draws connotations between words and phrases by deep learning a neural embedding across more than 1.8 billion words of modern fiction. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated from common topics in our web dataset, like neglect, government, and social media. We show that Empath's data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.

View on arXiv PDF Code

Similar