CLCOMar 27, 2017

A Tidy Data Model for Natural Language Processing using cleanNLP

arXiv:1703.09570v232 citations
Originality Synthesis-oriented
AI Analysis

This work offers a domain-specific solution for researchers and practitioners in NLP who need efficient data preprocessing, but it is incremental as it builds on existing CoreNLP technology.

The authors tackled the problem of converting textual corpora into normalized tables for NLP by introducing the cleanNLP package, which provides fast tools using Stanford's CoreNLP library to perform various annotation tasks across multiple languages.

The package cleanNLP provides a set of fast tools for converting a textual corpus into a set of normalized tables. The underlying natural language processing pipeline utilizes Stanford's CoreNLP library, exposing a number of annotation tasks for text written in English, French, German, and Spanish. Annotators include tokenization, part of speech tagging, named entity recognition, entity linking, sentiment analysis, dependency parsing, coreference resolution, and information extraction.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes