CVMay 17, 2020

T-VSE: Transformer-Based Visual Semantic Embedding

arXiv:2005.08399v17 citations
Originality Incremental advance
AI Analysis

This work addresses cross-modal retrieval for e-commerce applications, representing an incremental improvement over existing methods.

The paper tackled the problem of cross-modal image/text search by showing that transformer-based embeddings outperform simpler models like word average and RNNs when trained on a large e-commerce dataset, achieving significant performance gains.

Transformer models have recently achieved impressive performance on NLP tasks, owing to new algorithms for self-supervised pre-training on very large text corpora. In contrast, recent literature suggests that simple average word models outperform more complicated language models, e.g., RNNs and Transformers, on cross-modal image/text search tasks on standard benchmarks, like MS COCO. In this paper, we show that dataset scale and training strategy are critical and demonstrate that transformer-based cross-modal embeddings outperform word average and RNN-based embeddings by a large margin, when trained on a large dataset of e-commerce product image-title pairs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes