VKIE: The Application of Key Information Extraction on Video Text
This work addresses a domain-specific task for industry applications in video analysis, but it appears incremental as it builds on existing methods for key information extraction.
The paper tackles the problem of extracting hierarchical key information from visual texts in videos by proposing two solutions, PipVKIE and UniVKIE, which achieve remarkable performance and efficient inference speed on a well-defined dataset.
Extracting structured information from videos is critical for numerous downstream applications in the industry. In this paper, we define a significant task of extracting hierarchical key information from visual texts on videos. To fulfill this task, we decouple it into four subtasks and introduce two implementation solutions called PipVKIE and UniVKIE. PipVKIE sequentially completes the four subtasks in continuous stages, while UniVKIE is improved by unifying all the subtasks into one backbone. Both PipVKIE and UniVKIE leverage multimodal information from vision, text, and coordinates for feature representation. Extensive experiments on one well-defined dataset demonstrate that our solutions can achieve remarkable performance and efficient inference speed.