CLDec 8, 2025

Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

arXiv:2512.08094v1h-index: 16
AI Analysis

This provides a universal solution for generating high-quality parallel data to advance sign language processing, addressing a domain-specific bottleneck.

The authors tackled the problem of aligning subtitles to continuous sign language videos across multiple languages and domains, achieving state-of-the-art alignment performance on four datasets with a method that runs efficiently on CPUs within a minute for hour-long episodes.

The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes