CL SISep 22, 2015

A Review of Features for the Discrimination of Twitter Users: Application to the Prediction of Offline Influence

Jean-Valère Cossu, Vincent Labatut, Nicolas Dugué

arXiv:1509.06585v32.239 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of feature selection for Twitter user classification, specifically for predicting offline influence, but it is incremental as it builds on existing literature with a new typology and application.

The authors tackled the problem of selecting appropriate features for classifying Twitter users by reviewing and unifying heterogeneous features from the literature, then applied them to predict offline influence. They found that traditional online influence features like retweets and followers were not relevant, but their content-based approaches outperformed state-of-the-art methods on the CLEF RepLab 2014 dataset.

Many works related to Twitter aim at characterizing its users in some way: role on the service (spammers, bots, organizations, etc.), nature of the user (socio-professional category, age, etc.), topics of interest , and others. However, for a given user classification problem, it is very difficult to select a set of appropriate features, because the many features described in the literature are very heterogeneous, with name overlaps and collisions, and numerous very close variants. In this article, we review a wide range of such features. In order to present a clear state-of-the-art description, we unify their names, definitions and relationships, and we propose a new, neutral, typology. We then illustrate the interest of our review by applying a selection of these features to the offline influence detection problem. This task consists in identifying users which are influential in real-life, based on their Twitter account and related data. We show that most features deemed efficient to predict online influence, such as the numbers of retweets and followers, are not relevant to this problem. However, We propose several content-based approaches to label Twitter users as Influencers or not. We also rank them according to a predicted influence level. Our proposals are evaluated over the CLEF RepLab 2014 dataset, and outmatch state-of-the-art methods.

View on arXiv PDF Code

Similar