CVCLOct 27, 2022

Masked Vision-Language Transformer in Fashion

arXiv:2210.15110v127 citationsh-index: 191Has Code
Originality Incremental advance
AI Analysis

This work addresses multi-modal representation for the fashion domain, offering an incremental improvement over existing methods.

The paper tackles fashion-specific multi-modal representation by introducing a masked vision-language transformer (MVLT), which replaces BERT with a vision transformer for end-to-end processing and includes masked image reconstruction for fine-grained understanding. Experimental results show improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) over the Fashion-Gen 2018 winner Kaleido-BERT.

We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize vision transformer architecture for replacing the BERT in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at https://github.com/GewelsJI/MVLT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes