SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 25+ Sign Languages

Sen Fang, Hongbin Zhong, Yanxin Zhang, Dimitris N. Metaxas

arXiv:2605.0172038.7

AI Analysis

This dataset addresses the lack of a large-scale, pose-native resource for sign language processing, enabling more robust open-world recognition and generation compared to RGB-based methods.

SignVerse-2M provides a two-million-clip pose-native dataset covering 25+ sign languages, enabling robust open-world sign language recognition and generation by converting raw videos into unified DWPose sequences. The dataset supports multilingual pose-space modeling and is compatible with modern pose-driven pipelines.

Existing large-scale sign language resources typically provide supervision only at the level of raw video-text alignment and are often produced in laboratory settings. While such resources are important for semantic understanding, they do not directly provide a unified interface for open-world recognition and translation, or for modern pose-driven sign language video generation frameworks: 1. RGB-based pretrained recognition models depend heavily on fixed backgrounds or clothing conditions during recording, and are less robust in open-world settings than style-agnostic pose-processing models. 2. Recent pose-guided image/video generation models mostly use a unified keypoint representation such as DWPose as their control interface. At present, the sign language field still lacks a data resource that can directly interface with this modern pose-native paradigm while also targeting real-world open scenarios. We present SignVerse-2M, a large-scale multilingual pose-native dataset for sign language pose modeling and evaluation. Built from publicly available multilingual sign language video resources, it applies DWPose in a unified preprocessing pipeline to convert raw videos into 2D pose sequences that can be used directly for modeling, resulting in a consolidated corpus of about two million clips covering more than 25 sign languages. Unlike many laboratory datasets, this resource preserves the recording conditions and speaker diversity of real-world videos while reducing appearance variation through a unified pose representation. Toward this goal, we further provide the data construction pipeline, task definitions, and a simple SignDW Transformer baseline, demonstrating the feasibility of this resource for multilingual pose-space modeling and its compatibility with modern pose-driven pipelines, while discussing the evaluation claims it can support as well as its current limitations.

View on arXiv PDF

Similar