All You Can Embed: Natural Language Based Vehicle Retrieval With Spatio-temporal Transformers
2021 Β· Carmelo Scribano, Davide Sapienza, Giorgia Franchini, et al.
Abstract
Combining Natural Language with Vision represents a unique and interesting challenge in the domain of Artificial Intelligence. The AI City Challenge Track 5 for Natural Language-Based Vehicle Retrieval focuses on the problem of combining visual and textual information, applied to a smart-city use case. In this paper, we present All You Can Embed (AYCE), a modular solution to correlate single-vehicle tracking sequences with natural language. The main building blocks of the proposed architecture are (i) BERT to provide an embedding of the textual descriptions, (ii) a convolutional backbone along with a Transformer model to embed the visual information. For the training of the retrieval model, a variation of the Triplet Margin Loss is proposed to learn a distance measure between the visual and language embeddings. The code is publicly available at https://github.com/cscribano/AYCE_2021.
Authors
(none)
Tags
Stats
Code
Related papers
- Connecting Language And Vision For Natural Language-based Vehicle Retrieval (2021)15.14
- Symmetric Network With Spatial Relationship Modeling For Natural Language-based Vehicle Retrieval (2022)11.26
- Findvehicle And Vehiclefinder: A NER Dataset For Natural Language-based Vehicle Retrieval And A Keyword-based Cross-modal Vehicle Retrieval System (2023)10.38
- Dual Embedding Expansion For Vehicle Re-identification (2020)6.77
- BEV-TSR: Text-scene Retrieval In BEV Space For Autonomous Driving (2024)6.34
- Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers (2021)15.16
- Multi-modal Transformer For Video Retrieval (2020)19.47
- Learning Joint Representations Of Videos And Sentences With Web Image Search (2016)12.93