CVMar 3, 2025
ClipGrader: Leveraging Vision-Language Models for Robust Label Quality Assessment in Object DetectionHong Lu, Yali Bian, Rahul C. Shah
High-quality annotations are essential for object detection models, but ensuring label accuracy - especially for bounding boxes - remains both challenging and costly. This paper introduces ClipGrader, a novel approach that leverages vision-language models to automatically assess the accuracy of bounding box annotations. By adapting CLIP (Contrastive Language-Image Pre-training) to evaluate both class label correctness and spatial precision of bounding box, ClipGrader offers an effective solution for grading object detection labels. Tested on modified object detection datasets with artificially disturbed bounding boxes, ClipGrader achieves 91% accuracy on COCO with a 1.8% false positive rate. Moreover, it maintains 87% accuracy with a 2.1% false positive rate when trained on just 10% of the COCO data. ClipGrader also scales effectively to larger datasets such as LVIS, achieving 79% accuracy across 1,203 classes. Our experiments demonstrate ClipGrader's ability to identify errors in existing COCO annotations, highlighting its potential for dataset refinement. When integrated into a semi-supervised object detection (SSOD) model, ClipGrader readily improves the pseudo label quality, helping achieve higher mAP (mean Average Precision) throughout the training process. ClipGrader thus provides a scalable AI-assisted tool for enhancing annotation quality control and verifying annotations in large-scale object detection datasets.
LGMay 26, 2023
DeepSI: Interactive Deep Learning for Semantic InteractionYali Bian, Chris North
In this paper, we design novel interactive deep learning methods to improve semantic interactions in visual analytics applications. The ability of semantic interaction to infer analysts' precise intents during sensemaking is dependent on the quality of the underlying data representation. We propose the $\text{DeepSI}_{\text{finetune}}$ framework that integrates deep learning into the human-in-the-loop interactive sensemaking pipeline, with two important properties. First, deep learning extracts meaningful representations from raw data, which improves semantic interaction inference. Second, semantic interactions are exploited to fine-tune the deep learning representations, which then further improves semantic interaction inference. This feedback loop between human interaction and deep learning enables efficient learning of user- and task-specific representations. To evaluate the advantage of embedding the deep learning within the semantic interaction loop, we compare $\text{DeepSI}_{\text{finetune}}$ against a state-of-the-art but more basic use of deep learning as only a feature extractor pre-processed outside of the interactive loop. Results of two complementary studies, a human-centered qualitative case study and an algorithm-centered simulation-based quantitative experiment, show that $\text{DeepSI}_{\text{finetune}}$ more accurately captures users' complex mental models with fewer interactions.
HCJul 31, 2020
Evaluating Semantic Interaction on Word Embeddings via SimulationYali Bian, Michelle Dowling, Chris North
Semantic interaction (SI) attempts to learn the user's cognitive intents as they directly manipulate data projections during sensemaking activity. For text analysis, prior implementations of SI have used common data features, such as bag-of-words representations, for machine learning from user interactions. Instead, we hypothesize that features derived from deep learning word embeddings will enable SI to better capture the user's subtle intents. However, evaluating these effects is difficult. SI systems are usually evaluated by a human-centred qualitative approach, by observing the utility and effectiveness of the application for end-users. This approach has drawbacks in terms of replicability, scalability, and objectiveness, which makes it hard to perform convincing contrast experiments between different SI models. To tackle this problem, we explore a quantitative algorithm-centered analysis as a complementary evaluation approach, by simulating users' interactions and calculating the accuracy of the learned model. We use these methods to compare word-embeddings to bag-of-words features for SI.
HCJul 31, 2020
DeepVA: Bridging Cognition and Computation through Semantic Interaction and Deep LearningYali Bian, John Wenskovitch, Chris North
This paper examines how deep learning (DL) representations, in contrast to traditional engineered features, can support semantic interaction (SI) in visual analytics. SI attempts to model user's cognitive reasoning via their interaction with data items, based on the data features. We hypothesize that DL representations contain meaningful high-level abstractions that can better capture users' high-level cognitive intent. To bridge the gap between cognition and computation in visual analytics, we propose DeepVA (Deep Visual Analytics), which uses high-level deep learning representations for semantic interaction instead of low-level hand-crafted data features. To evaluate DeepVA and compare to SI models with lower-level features, we design and implement a system that extends a traditional SI pipeline with features at three different levels of abstraction. To test the relationship between task abstraction and feature abstraction in SI, we perform visual concept learning tasks at three different task abstraction levels, using semantic interaction with three different feature abstraction levels. DeepVA effectively hastened interactive convergence between cognitive understanding and computational modeling of the data, especially in high abstraction tasks.