CLJul 12, 2017

N-GrAM: New Groningen Author-profiling Model

Angelo Basile, Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma, Malvina Nissim

arXiv:1707.03764v12.791 citations

Originality Synthesis-oriented

AI Analysis

This work addresses author profiling for forensic or social media analysis, but it is incremental as it applies existing methods to a shared task.

The authors tackled author profiling by creating a single model to identify gender and language variety across four languages, achieving an average accuracy of 0.86 with sub-task performance ranging from 0.68 to 0.98 using a linear SVM with word unigrams and character n-grams.

We describe our participation in the PAN 2017 shared task on Author Profiling, identifying authors' gender and language variety for English, Spanish, Arabic and Portuguese. We describe both the final, submitted system, and a series of negative results. Our aim was to create a single model for both gender and language, and for all language varieties. Our best-performing system (on cross-validated results) is a linear support vector machine (SVM) with word unigrams and character 3- to 5-grams as features. A set of additional features, including POS tags, additional datasets, geographic entities, and Twitter handles, hurt, rather than improve, performance. Results from cross-validation indicated high performance overall and results on the test set confirmed them, at 0.86 averaged accuracy, with performance on sub-tasks ranging from 0.68 to 0.98.

View on arXiv PDF

Similar