A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer
This work addresses the need for more diverse and extensive datasets in video text spotting, particularly for bilingual and open-world scenarios, though it is incremental in combining existing Transformer methods with new data.
The authors tackled the problem of limited video text spotting benchmarks by introducing BOVText, a large-scale bilingual dataset with 2,000+ videos and 1.75M+ frames, and proposed TransVTSpotter, an end-to-end Transformer-based framework that achieved state-of-the-art performance with 44.1% MOTA at 9 fps on ICDAR2015(video).
Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data. In this work, we introduce a large-scale, Bilingual, Open World Video text benchmark dataset(BOVText). There are four features for BOVText. Firstly, we provide 2,000+ videos with more than 1,750,000+ frames, 25 times larger than the existing largest dataset with incidental text in videos. Secondly, our dataset covers 30+ open categories with a wide selection of various scenarios, e.g., Life Vlog, Driving, Movie, etc. Thirdly, abundant text types annotation (i.e., title, caption or scene text) are provided for the different representational meanings in video. Fourthly, the BOVText provides bilingual text annotation to promote multiple cultures live and communication. Besides, we propose an end-to-end video text spotting framework with Transformer, termed TransVTSpotter, which solves the multi-orient text spotting in video with a simple, but efficient attention-based query-key mechanism. It applies object features from the previous frame as a tracking query for the current frame and introduces a rotation angle prediction to fit the multiorient text instance. On ICDAR2015(video), TransVTSpotter achieves the state-of-the-art performance with 44.1% MOTA, 9 fps. The dataset and code of TransVTSpotter can be found at github:com=weijiawu=BOVText and github:com=weijiawu=TransVTSpotter, respectively.