Specification-Driven Video Search via Foundation Models and Formal Verification
This addresses the need for efficient and privacy-preserving video search, offering a novel approach that avoids manual inspection or massive training data, though it is incremental in combining existing techniques.
The paper tackles the problem of automatically searching for events of interest in video clips, such as emergency incidents, by developing a method that uses vision and language models and formal verification, achieving over 90% precision on privacy-sensitive videos and an autonomous driving dataset.
The increasing abundance of video data enables users to search for events of interest, e.g., emergency incidents. Meanwhile, it raises new concerns, such as the need for preserving privacy. Existing approaches to video search require either manual inspection or a deep learning model with massive training. We develop a method that uses recent advances in vision and language models, as well as formal methods, to search for events of interest in video clips automatically and efficiently. The method consists of an algorithm to map text-based event descriptions into linear temporal logic over finite traces (LTL$_f$) and an algorithm to construct an automaton encoding the video information. Then, the method formally verifies the automaton representing the video against the LTL$_f$ specifications and adds the pertinent video clips to the search result if the automaton satisfies the specifications. We provide qualitative and quantitative analysis to demonstrate the video-searching capability of the proposed method. It achieves over 90 percent precision in searching over privacy-sensitive videos and a state-of-the-art autonomous driving dataset.