CVDec 11, 2023

TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation

Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao

arXiv:2312.06630v33.93 citationsh-index: 16Has CodeNIPS

Originality Incremental advance

AI Analysis

This work addresses the problem of scaling up annotated datasets for video instance segmentation, which is costly and limited, by enabling effective joint training across domain-specific datasets, though it is incremental in its approach to handling taxonomy heterogeneity.

The paper tackles the challenge of training video instance segmentation models across multiple isolated datasets with heterogeneous category spaces by proposing TMT-VIS, a taxonomy-aware joint training method that incorporates taxonomy information to improve classification precision. The model achieves state-of-the-art results on four benchmarks, including YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and UVO, demonstrating significant improvements over baseline solutions.

Training on large-scale datasets can boost the performance of video instance segmentation while the annotated datasets for VIS are hard to scale up due to the high labor cost. What we possess are numerous isolated filed-specific datasets, thus, it is appealing to jointly train models across the aggregation of datasets to enhance data volume and diversity. However, due to the heterogeneity in category space, as mask precision increases with the data volume, simply utilizing multiple datasets will dilute the attention of models on different taxonomies. Thus, increasing the data scale and enriching taxonomy space while improving classification precision is important. In this work, we analyze that providing extra taxonomy information can help models concentrate on specific taxonomy, and propose our model named Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation (TMT-VIS) to address this vital challenge. Specifically, we design a two-stage taxonomy aggregation module that first compiles taxonomy information from input videos and then aggregates these taxonomy priors into instance queries before the transformer decoder. We conduct extensive experimental evaluations on four popular and challenging benchmarks, including YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and UVO. Our model shows significant improvement over the baseline solutions, and sets new state-of-the-art records on all benchmarks. These appealing and encouraging results demonstrate the effectiveness and generality of our approach. The code is available at https://github.com/rkzheng99/TMT-VIS .

View on arXiv PDF Code

Similar