The TUB Sign Language Corpus Collection
This addresses the problem of limited resources for sign language processing research, particularly benefiting linguists and AI developers working on accessibility tools, though it is incremental as it expands existing data collection methods.
The researchers tackled the scarcity of sign language data by creating a parallel corpus collection of 12 sign languages in video format with subtitles, resulting in over 1,300 hours of video and 14 million tokens across 1.3 million subtitles, including the first consistent corpora for 8 Latin American sign languages and a tenfold increase in German Sign Language data.
We present a collection of parallel corpora of 12 sign languages in video format, together with subtitles in the dominant spoken languages of the corresponding countries. The entire collection includes more than 1,300 hours in 4,381 video files, accompanied by 1,3~M subtitles containing 14~M tokens. Most notably, it includes the first consistent parallel corpora for 8 Latin American sign languages, whereas the size of the German Sign Language corpora is ten times the size of the previously available corpora. The collection was created by collecting and processing videos of multiple sign languages from various online sources, mainly broadcast material of news shows, governmental bodies and educational channels. The preparation involved several stages, including data collection, informing the content creators and seeking usage approvals, scraping, and cropping. The paper provides statistics on the collection and an overview of the methods used to collect the data.