Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization
This work addresses the challenge of grounding descriptions in video analysis for computer vision applications, representing an incremental improvement over previous joint training methods.
The paper tackled the Entities Object Localization problem by separating caption generation and object grounding into two stages, improving each with a unified pre-training model and fine-tuned detection, achieving state-of-the-art results of 72.57% localization accuracy and 0.2477 F1 score on benchmark datasets.
Entities Object Localization (EOL) aims to evaluate how grounded or faithful a description is, which consists of caption generation and object grounding. Previous works tackle this problem by jointly training the two modules in a framework, which limits the complexity of each module. Therefore, in this work, we propose to divide these two modules into two stages and improve them respectively to boost the whole system performance. For the caption generation, we propose a Unified Multi-modal Pre-training Model (UMPM) to generate event descriptions with rich objects for better localization. For the object grounding, we fine-tune the state-of-the-art detection model MDETR and design a post processing method to make the grounding results more faithful. Our overall system achieves the state-of-the-art performances on both sub-tasks in Entities Object Localization challenge at Activitynet 2021, with 72.57 localization accuracy on the testing set of sub-task I and 0.2477 F1_all_per_sent on the hidden testing set of sub-task II.