Exploring the Daschle Collection using Text Mining
This work provides a tool for efficiently summarizing large political document collections, but it is incremental as it applies standard NLP methods to a new dataset.
The study applied Latent Dirichlet Allocation (LDA) to analyze scanned documents and emails from Senator Daschle's collection, identifying major topics and reflecting key events and issues from his career, enabling efficient summarization of large text datasets.
A U.S. Senator from South Dakota donated documents that were accumulated during his service as a house representative and senator to be housed at the Bridges library at South Dakota State University. This project investigated the utility of quantitative statistical methods to explore some portions of this vast document collection. The available scanned documents and emails from constituents are analyzed using natural language processing methods including the Latent Dirichlet Allocation (LDA) model. This model identified major topics being discussed in a given collection of documents. Important events and popular issues from the Senator Daschles career are reflected in the changing topics from the model. These quantitative statistical methods provide a summary of the massive amount of text without requiring significant human effort or time and can be applied to similar collections.