HLVU : A New Challenge to Test Deep Understanding of Movies the Way Humans do
This work addresses a foundational problem in video understanding, potentially benefiting streaming services and broadcasters by enabling more intuitive video interaction, though it is incremental as it extends text-domain research to video.
The authors introduced the HLVU challenge and dataset to test if computer systems can understand movies by recognizing relationships among actors and non-visual concepts, aiming to benchmark this against human-like comprehension.
In this paper we propose a new evaluation challenge and direction in the area of High-level Video Understanding. The challenge we are proposing is designed to test automatic video analysis and understanding, and how accurately systems can comprehend a movie in terms of actors, entities, events and their relationship to each other. A pilot High-Level Video Understanding (HLVU) dataset of open source movies were collected for human assessors to build a knowledge graph representing each of them. A set of queries will be derived from the knowledge graph to test systems on retrieving relationships among actors, as well as reasoning and retrieving non-visual concepts. The objective is to benchmark if a computer system can "understand" non-explicit but obvious relationships the same way humans do when they watch the same movies. This is long-standing problem that is being addressed in the text domain and this project moves similar research to the video domain. Work of this nature is foundational to future video analytics and video understanding technologies. This work can be of interest to streaming services and broadcasters hoping to provide more intuitive ways for their customers to interact with and consume video content.