CVMay 28, 2025

IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth

Md Touhidul Islam, Imran Kabir, Md Alimoor Reza, Syed Masum Billah

arXiv:2505.22305v13.6h-index: 11Conference on Designing Interactive Systems

Originality Incremental advance

AI Analysis

This addresses the challenge of evaluating vision-language models for researchers and practitioners when ground truth is unavailable, though it is incremental as it builds on existing visualization and adversarial techniques.

The paper tackles the problem of evaluating vision-language models in video object recognition without ground truth by introducing IKIWISI, an interactive visual pattern generator that uses heatmaps and spy objects to assess model reliability, with a study showing users found it easy to use and made assessments correlating with objective metrics.

We present IKIWISI ("I Know It When I See It"), an interactive visual pattern generator for assessing vision-language models in video object recognition when ground truth is unavailable. IKIWISI transforms model outputs into a binary heatmap where green cells indicate object presence and red cells indicate object absence. This visualization leverages humans' innate pattern recognition abilities to evaluate model reliability. IKIWISI introduces "spy objects": adversarial instances users know are absent, to discern models hallucinating on nonexistent items. The tool functions as a cognitive audit mechanism, surfacing mismatches between human and machine perception by visualizing where models diverge from human understanding. Our study with 15 participants found that users considered IKIWISI easy to use, made assessments that correlated with objective metrics when available, and reached informed conclusions by examining only a small fraction of heatmap cells. This approach not only complements traditional evaluation methods through visual assessment of model behavior with custom object sets, but also reveals opportunities for improving alignment between human perception and machine understanding in vision-language systems.

View on arXiv PDF

Similar