Transparent and Coherent Procedural Mistake Detection
This work addresses the challenge of making PMD more transparent and coherent for applications involving human task execution monitoring, though it is incremental as it builds on existing VLMs and metrics.
The paper tackles the problem of procedural mistake detection (PMD) by extending it to require generating visual self-dialog rationales for transparency, and shows that while vision-and-language models (VLMs) struggle initially, their accuracy, coherence, and efficiency can be improved with trade-offs by incorporating automated metrics into inference and fine-tuning methods.
Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that VLMs struggle off-the-shelf, but with some trade-offs, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and fine-tuning methods. Lastly, our multi-faceted metrics visualize common outcomes, highlighting areas for further improvement.