Towards Zero-Shot & Explainable Video Description by Reasoning over Graphs of Events in Space and Time
This addresses the long-standing challenge of explainable video description for applications requiring interpretable AI systems, though it appears to be an incremental hybrid approach combining existing vision/language models with new reasoning.
The paper tackles the problem of understanding relationships between vision and language by proposing a graph-based reasoning approach over events in space and time to generate video descriptions. It validates the method by showing it produces coherent descriptions on various datasets using both traditional metrics and modern LLM evaluation.
In the current era of Machine Learning, Transformers have become the de facto approach across a variety of domains, such as computer vision and natural language processing. Transformer-based solutions are the backbone of current state-of-the-art methods for language generation, image and video classification, segmentation, action and object recognition, among many others. Interestingly enough, while these state-of-the-art methods produce impressive results in their respective domains, the problem of understanding the relationship between vision and language is still beyond our reach. In this work, we propose a common ground between vision and language based on events in space and time in an explainable and programmatic way, to connect learning-based vision and language state of the art models and provide a solution to the long standing problem of describing videos in natural language. We validate that our algorithmic approach is able to generate coherent, rich and relevant textual descriptions on videos collected from a variety of datasets, using both standard metrics (e.g. Bleu, ROUGE) and the modern LLM-as-a-Jury approach.