Mining Hidden Populations through Attributed Search
This work addresses the challenge for researchers and data scientists in efficiently sampling non-queryable target populations from social media platforms, representing an incremental improvement over existing methods.
The paper tackles the problem of efficiently discovering hidden populations on social networks when their defining property is not directly queryable via APIs, by proposing a Decision tree-based Thompson sampler (DT-TMP) that exploits attribute correlations and hierarchical ordering; it outperforms state-of-the-art samplers by 54% on Twitter in online experiments and by 0.9-1.5× in offline experiments.
Researchers often query online social platforms through their application programming interfaces (API) to find target populations such as people with mental illness~\cite{De-Choudhury2017} and jazz musicians~\cite{heckathorn2001finding}. Entities of such target population satisfy a property that is typically identified using an oracle (human or a pre-trained classifier). When the property of the target entities is not directly queryable via the API, we refer to the property as `hidden' and the population as a hidden population. Finding individuals who belong to these populations on social networks is hard because they are non-queryable, and the sampler has to explore from a combinatorial query space within a finite budget limit. By exploiting the correlation between queryable attributes and the population of interest and by hierarchically ordering the query space, we propose a Decision tree-based Thompson sampler (\texttt{DT-TMP}) that efficiently discovers the right combination of attributes to query. Our proposed sampler outperforms the state-of-the-art samplers in online experiments, for example by 54\% on Twitter. When the number of matching entities to a query is known in offline experiments, \texttt{DT-TMP} performs exceedingly well by a factor of 0.9-1.5$\times$ over the baseline samplers. In the future, we wish to explore the option of finding hidden populations by formulating more complex queries.