The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms
This provides a large-scale, expert-labeled dataset for researchers developing cough analysis algorithms, addressing an urgent need in global health crises like COVID-19, but it is incremental as it builds on existing data collection efforts.
The authors tackled the lack of a validated database for training machine learning models in cough audio classification, particularly for COVID-19 screening, by creating the COUGHVID dataset with over 20,000 crowdsourced cough recordings and expert labels for more than 2,000 recordings.
Cough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. However, there is currently no validated database of cough sounds with which to train such ML models. The COUGHVID dataset provides over 20,000 crowdsourced cough recordings representing a wide range of subject ages, genders, geographic locations, and COVID-19 statuses. First, we filtered the dataset using our open-sourced cough detection algorithm. Second, experienced pulmonologists labeled more than 2,000 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks. Finally, we ensured that coughs labeled as symptomatic and COVID-19 originate from countries with high infection rates, and that their expert labels are consistent. As a result, the COUGHVID dataset contributes a wealth of cough recordings for training ML models to address the world's most urgent health crises.