Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ
This tool addresses the need for researchers and analysts to compare corpora, though it is incremental as it builds on existing visualization methods.
The paper tackles the problem of visualizing linguistic differences between document categories by introducing Scattertext, a browser-based tool that displays a scatterplot with thousands of term points and legibly labels hundreds, enabling language-independent analysis.
Scattertext is an open source tool for visualizing linguistic variation between document categories in a language-independent way. The tool presents a scatterplot, where each axis corresponds to the rank-frequency a term occurs in a category of documents. Through a tie-breaking strategy, the tool is able to display thousands of visible term-representing points and find space to legibly label hundreds of them. Scattertext also lends itself to a query-based visualization of how the use of terms with similar embeddings differs between document categories, as well as a visualization for comparing the importance scores of bag-of-words features to univariate metrics.