CL CY SIApr 9, 2020

PANDORA Talks: Personality and Demographics on Reddit

Matej Gjurković, Mladen Karan, Iva Vukojević, Mihaela Bošnjak, Jan Šnajder

arXiv:2004.04460v328.2741 citations

Originality Synthesis-oriented

AI Analysis

This dataset addresses a gap for researchers in NLP and social sciences by enabling interpretability and bias removal, though it is incremental as it focuses on data collection rather than novel methods.

The authors tackled the scarcity of datasets with both personality and demographic labels by introducing PANDORA, a large-scale dataset of Reddit comments labeled with three personality models and demographics for over 10k users, and demonstrated its usefulness through experiments including predicting Big 5 traits, analyzing gender classification biases, and providing benchmark models.

Personality and demographics are important variables in social sciences, while in NLP they can aid in interpretability and removal of societal biases. However, datasets with both personality and demographic labels are scarce. To address this, we present PANDORA, the first large-scale dataset of Reddit comments labeled with three personality models (including the well-established Big 5 model) and demographics (age, gender, and location) for more than 10k users. We showcase the usefulness of this dataset on three experiments, where we leverage the more readily available data from other personality models to predict the Big 5 traits, analyze gender classification biases arising from psycho-demographic variables, and carry out a confirmatory and exploratory analysis based on psychological theories. Finally, we present benchmark prediction models for all personality and demographic variables.

View on arXiv PDF

Similar