IROct 5, 2018

Sifaka: Text Mining Above a Search API

arXiv:1810.02907v11.7Has Code

Originality Synthesis-oriented

AI Analysis

This makes text mining more accessible to researchers with limited programming skills, though it is incremental in leveraging existing search APIs.

The authors tackled the high cost and complexity of text mining systems by building Sifaka, an open-source application on top of a standard search engine index, which enables quick exploration and analysis of large text collections with integrated annotation and machine learning compatibility.

Text mining and analytics software has become popular, but little attention has been paid to the software architectures of such systems. Often they are built from scratch using special-purpose software and data structures, which increases their cost and complexity. This demo paper describes Sifaka, a new open-source text mining application constructed above a standard search engine index using existing application programmer interface (API) calls. Indexing integrates popular annotation software libraries to augment the full-text index with noun phrase and named-entities; n-grams are also provided. Sifaka enables a person to quickly explore and analyze large text collections using search, frequency analysis, and co-occurrence analysis; and import existing document labels or interactively construct document sets that are positive or negative examples of new concepts, perform feature selection, and export feature vectors compatible with popular machine learning software. Sifaka demonstrates that search engines are good platforms for text mining applications while also making common IR text mining capabilities accessible to researchers in disciplines where programming skills are less common.

View on arXiv PDF

Similar