Paulina Varshavskaya

h-index9

3papers

3citations

Novelty42%

AI Score32

Ranked #126,299 of 194,257 authors (top 65%)#27,807 in LG (top 69%)

3 Papers

2.6LGSep 30, 2024

Fine-tuning Vision Classifiers On A Budget

Sunil Kumar, Ted Sandler, Paulina Varshavskaya · amazon-science

Fine-tuning modern computer vision models requires accurately labeled data for which the ground truth may not exist, but a set of multiple labels can be obtained from labelers of variable accuracy. We tie the notion of label quality to confidence in labeler accuracy and show that, when prior estimates of labeler accuracy are available, using a simple naive-Bayes model to estimate the true labels allows us to label more data on a fixed budget without compromising label or fine-tuning quality. We present experiments on a dataset of industrial images that demonstrates that our method, called Ground Truth Extension (GTX), enables fine-tuning ML models using fewer human labels.

3.7CVSep 25, 2024Code

Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

Bowen Zhao, Leo Parker Dirac, Paulina Varshavskaya

Large vision-language models (VLMs) have become state-of-the-art for many computer vision tasks, with in-context learning (ICL) as a popular adaptation strategy for new ones. But can VLMs learn novel concepts purely from visual demonstrations, or are they limited to adapting to the output format of ICL examples? We propose a new benchmark we call Spatial Visual Ambiguity Tasks (SVAT) that challenges state-of-the-art VLMs to learn new visuospatial tasks in-context. We find that VLMs fail to do this zero-shot, and sometimes continue to fail after finetuning. However, adding simpler data to the training by curriculum learning leads to improved ICL performance.

11.4LGJun 10, 2025

Reinforcing VLMs to Use Tools for Detailed Visual Reasoning Under Resource Constraints

Sunil Kumar, Bowen Zhao, Leo Dirac et al.

Despite tremendous recent advances in large model reasoning ability, vision-language models (VLMs) still struggle with detailed visual reasoning, especially when compute resources are limited. To address this challenge, we draw inspiration from methods like Deepseek-r1 for VLMs and train smaller-scale models with Group Relative Policy Optimization (GRPO) to use external tools such as zoom. The greatest benefit is obtained with a combination of GRPO learning, a simple reward structure, a simplified tool-calling interface, allocating additional tokens to the result of the tool call, and a training data mix that over-represents visually difficult examples. Compared to similarly-sized baseline models, our method achieves better performance on some visual question-answering (VQA) tasks, thanks to the detailed visual information gathered from the external tool.