LLM-powered Query Expansion for Enhancing Boundary Prediction in Language-driven Action Localization
This addresses the challenge of precise action boundary prediction in video analysis, which is incremental as it builds on existing models with a model-agnostic enhancement.
The paper tackles the problem of boundary uncertainty in language-driven action localization by using LLMs to expand queries with boundary descriptions and modeling boundary probabilities, resulting in improved localization performance as demonstrated on several datasets.
Language-driven action localization in videos requires not only semantic alignment between language query and video segment, but also prediction of action boundaries. However, the language query primarily describes the main content of an action and usually lacks specific details of action start and end boundaries, which increases the subjectivity of manual boundary annotation and leads to boundary uncertainty in training data. In this paper, on one hand, we propose to expand the original query by generating textual descriptions of the action start and end boundaries through LLMs, which can provide more detailed boundary cues for localization and thus reduce the impact of boundary uncertainty. On the other hand, to enhance the tolerance to boundary uncertainty during training, we propose to model probability scores of action boundaries by calculating the semantic similarities between frames and the expanded query as well as the temporal distances between frames and the annotated boundary frames. They can provide more consistent boundary supervision, thus improving the stability of training. Our method is model-agnostic and can be seamlessly and easily integrated into any existing models of language-driven action localization in an off-the-shelf manner. Experimental results on several datasets demonstrate the effectiveness of our method.