CLNov 5, 2013

Using Robust PCA to estimate regional characteristics of language use from geo-tagged Twitter messages

Dániel Kondor, István Csabai, László Dobos, János Szüle, Norbert Barankai, Tamás Hanyecz, Tamás Sebők, Zsófia Kallus, Gábor Vattay

arXiv:1311.1169v116 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of extracting meaningful geospatial patterns from noisy social media data for applications in text mining and regional analysis, though it is incremental as it adapts an existing method to a new domain.

The paper tackled the problem of identifying regional language features and topics from noisy geo-tagged Twitter data by applying Robust PCA to separate outliers and localized topics from low-dimensional structures, using a dataset of over 200 million tweets to identify smoothly varying geographic features.

Principal component analysis (PCA) and related techniques have been successfully employed in natural language processing. Text mining applications in the age of the online social media (OSM) face new challenges due to properties specific to these use cases (e.g. spelling issues specific to texts posted by users, the presence of spammers and bots, service announcements, etc.). In this paper, we employ a Robust PCA technique to separate typical outliers and highly localized topics from the low-dimensional structure present in language use in online social networks. Our focus is on identifying geospatial features among the messages posted by the users of the Twitter microblogging service. Using a dataset which consists of over 200 million geolocated tweets collected over the course of a year, we investigate whether the information present in word usage frequencies can be used to identify regional features of language use and topics of interest. Using the PCA pursuit method, we are able to identify important low-dimensional features, which constitute smoothly varying functions of the geographic location.

View on arXiv PDF

Similar