CLFeb 26, 2018

Did You Really Just Have a Heart Attack? Towards Robust Detection of Personal Health Mentions in Social Media

arXiv:1802.09130v22.872 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses a critical task for public health monitoring and digital epidemiology by improving detection of health mentions in social media, though it is incremental as it builds on existing methods with a focus on robustness and data efficiency.

The paper tackles the problem of detecting personal health mentions (PHM) in social media, which is challenging due to short posts, inventive language, and figurative usage, and proposes WESPAD, a method that combines multiple features and distorts word embeddings to generalize from few examples, outperforming state-of-the-art methods, especially with limited training data.

Millions of users share their experiences on social media sites, such as Twitter, which in turn generate valuable data for public health monitoring, digital epidemiology, and other analyses of population health at global scale. The first, critical, task for these applications is classifying whether a personal health event was mentioned, which we call the (PHM) problem. This task is challenging for many reasons, including typically short length of social media posts, inventive spelling and lexicons, and figurative language, including hyperbole using diseases like "heart attack" or "cancer" for emphasis, and not as a health self-report. This problem is even more challenging for rarely reported, or frequent but ambiguously expressed conditions, such as "stroke". To address this problem, we propose a general, robust method for detecting PHMs in social media, which we call WESPAD, that combines lexical, syntactic, word embedding-based, and context-based features. WESPAD is able to generalize from few examples by automatically distorting the word embedding space to most effectively detect the true health mentions. Unlike previously proposed state-of-the-art supervised and deep-learning techniques, WESPAD requires relatively little training data, which makes it possible to adapt, with minimal effort, to each new disease and condition. We evaluate WESPAD on both an established publicly available Flu detection benchmark, and on a new dataset that we have constructed with mentions of multiple health conditions. Our experiments show that WESPAD outperforms the baselines and state-of-the-art methods, especially in cases when the number and proportion of true health mentions in the training data is small.

View on arXiv PDF Code

Similar