Towards Large-Scale Data Mining for Data-Driven Analysis of Sign Languages
This work addresses data scarcity for sign language researchers, though it is incremental as it applies existing data mining techniques to a new domain.
The authors tackled the problem of inadequate sign language data by developing a pipeline to collect and filter data from social media platforms like TikTok, Instagram, and YouTube, enabling analysis of American and Brazilian Sign Languages with a focus on phonological parameters.
Access to sign language data is far from adequate. We show that it is possible to collect the data from social networking services such as TikTok, Instagram, and YouTube by applying data filtering to enforce quality standards and by discovering patterns in the filtered data, making it easier to analyse and model. Using our data collection pipeline, we collect and examine the interpretation of songs in both the American Sign Language (ASL) and the Brazilian Sign Language (Libras). We explore their differences and similarities by looking at the co-dependence of the orientation and location phonological parameters