CVJun 18, 2021

All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers

Carmelo Scribano, Davide Sapienza, Giorgia Franchini, Micaela Verucchi, Marko Bertogna

arXiv:2106.10153v12.66 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses vehicle retrieval for smart-city applications, but it is incremental as it builds on existing methods like BERT and Transformers.

The paper tackles the problem of retrieving vehicles using natural language descriptions by combining visual and textual information, achieving competitive performance on the AI City Challenge Track 5 benchmark.

Combining Natural Language with Vision represents a unique and interesting challenge in the domain of Artificial Intelligence. The AI City Challenge Track 5 for Natural Language-Based Vehicle Retrieval focuses on the problem of combining visual and textual information, applied to a smart-city use case. In this paper, we present All You Can Embed (AYCE), a modular solution to correlate single-vehicle tracking sequences with natural language. The main building blocks of the proposed architecture are (i) BERT to provide an embedding of the textual descriptions, (ii) a convolutional backbone along with a Transformer model to embed the visual information. For the training of the retrieval model, a variation of the Triplet Margin Loss is proposed to learn a distance measure between the visual and language embeddings. The code is publicly available at https://github.com/cscribano/AYCE_2021.

View on arXiv PDF Code

Similar