CLSDASApr 20, 2022

Cross-stitched Multi-modal Encoders

Amazon
arXiv:2204.09227v1h-index: 34
Originality Incremental advance
AI Analysis

This addresses the problem of integrating speech and text data for researchers and practitioners in multi-modal AI, offering a compact and resource-efficient solution.

The paper tackles multi-modal speech and text input by proposing a novel architecture that combines pretrained encoders using multi-headed cross-modal attention, enabling efficient token-level or utterance-level classification with improved capture of acoustic-prosodic and lexical information.

In this paper, we propose a novel architecture for multi-modal speech and text input. We combine pretrained speech and text encoders using multi-headed cross-modal attention and jointly fine-tune on the target problem. The resultant architecture can be used for continuous token-level classification or utterance-level prediction acting on simultaneous text and speech. The resultant encoder efficiently captures both acoustic-prosodic and lexical information. We compare the benefits of multi-headed attention-based fusion for multi-modal utterance-level classification against a simple concatenation of pre-pooled, modality-specific representations. Our model architecture is compact, resource efficient, and can be trained on a single consumer GPU card.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes