Cosine meets Softmax: A tough-to-beat baseline for visual grounding
This work addresses the problem of accurately linking text to visual regions for autonomous driving systems, showing that simpler methods can outperform complex approaches, though it is incremental in nature.
The paper tackles visual grounding for autonomous driving by proposing a simple baseline that uses cosine distance and cross-entropy loss, achieving 68.7% AP50 accuracy on the Talk2Car dataset, which improves upon the previous state of the art by 8.6%.
In this paper, we present a simple baseline for visual grounding for autonomous driving which outperforms the state of the art methods, while retaining minimal design choices. Our framework minimizes the cross-entropy loss over the cosine distance between multiple image ROI features with a text embedding (representing the give sentence/phrase). We use pre-trained networks for obtaining the initial embeddings and learn a transformation layer on top of the text embedding. We perform experiments on the Talk2Car dataset and achieve 68.7% AP50 accuracy, improving upon the previous state of the art by 8.6%. Our investigation suggests reconsideration towards more approaches employing sophisticated attention mechanisms or multi-stage reasoning or complex metric learning loss functions by showing promise in simpler alternatives.