CVAIMar 10, 2023

CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

arXiv:2303.05725v4107 citationsh-index: 44
Originality Highly original
AI Analysis

This work addresses sign language recognition, a weakly supervised task, by improving cross-modal alignment, offering a competitive alternative to complex multi-cue methods.

The paper tackles the problem of insufficient training data for sign language recognition by proposing CVT-SLR, a method that combines variational autoencoders and contrastive alignment to leverage pretrained visual and language knowledge, achieving state-of-the-art results on PHOENIX-2014 and PHOENIX-2014T datasets.

Sign language recognition (SLR) is a weakly supervised task that annotates sign videos as textual glosses. Recent studies show that insufficient training caused by the lack of large-scale available sign datasets becomes the main bottleneck for SLR. Most SLR works thereby adopt pretrained visual modules and develop two mainstream solutions. The multi-stream architectures extend multi-cue visual features, yielding the current SOTA performances but requiring complex designs and might introduce potential noise. Alternatively, the advanced single-cue SLR frameworks using explicit cross-modal alignment between visual and textual modalities are simple and effective, potentially competitive with the multi-cue framework. In this work, we propose a novel contrastive visual-textual transformation for SLR, CVT-SLR, to fully explore the pretrained knowledge of both the visual and language modalities. Based on the single-cue cross-modal alignment framework, we propose a variational autoencoder (VAE) for pretrained contextual knowledge while introducing the complete pretrained language module. The VAE implicitly aligns visual and textual modalities while benefiting from pretrained contextual knowledge as the traditional contextual module. Meanwhile, a contrastive cross-modal alignment algorithm is designed to explicitly enhance the consistency constraints. Extensive experiments on public datasets (PHOENIX-2014 and PHOENIX-2014T) demonstrate that our proposed CVT-SLR consistently outperforms existing single-cue methods and even outperforms SOTA multi-cue methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes