Maan Qraitem

CV
h-index25
9papers
125citations
Novelty53%
AI Score47

9 Papers

CVSep 30, 2022Code
Bias Mimicking: A Simple Sampling Approach for Bias Mitigation

Maan Qraitem, Kate Saenko, Bryan A. Plummer

Prior work has shown that Visual Recognition datasets frequently underrepresent bias groups $B$ (\eg Female) within class labels $Y$ (\eg Programmers). This dataset bias can lead to models that learn spurious correlations between class labels and bias groups such as age, gender, or race. Most recent methods that address this problem require significant architectural changes or additional loss functions requiring more hyper-parameter tuning. Alternatively, data sampling baselines from the class imbalance literature (\eg Undersampling, Upweighting), which can often be implemented in a single line of code and often have no hyperparameters, offer a cheaper and more efficient solution. However, these methods suffer from significant shortcomings. For example, Undersampling drops a significant part of the input distribution per epoch while Oversampling repeats samples, causing overfitting. To address these shortcomings, we introduce a new class-conditioned sampling method: Bias Mimicking. The method is based on the observation that if a class $c$ bias distribution, \ie $P_D(B|Y=c)$ is mimicked across every $c^{\prime}\neq c$, then $Y$ and $B$ are statistically independent. Using this notion, BM, through a novel training procedure, ensures that the model is exposed to the entire distribution per epoch without repeating samples. Consequently, Bias Mimicking improves underrepresented groups' accuracy of sampling methods by 3\% over four benchmarks while maintaining and sometimes improving performance over nonsampling methods. Code: \url{https://github.com/mqraitem/Bias-Mimicking}

CVAug 8, 2023Code
From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition

Maan Qraitem, Kate Saenko, Bryan A. Plummer

Visual recognition models are prone to learning spurious correlations induced by a biased training set where certain conditions $B$ (\eg, Indoors) are over-represented in certain classes $Y$ (\eg, Big Dogs). Synthetic data from off-the-shelf large-scale generative models offers a promising direction to mitigate this issue by augmenting underrepresented subgroups in the real dataset. However, by using a mixed distribution of real and synthetic data, we introduce another source of bias due to distributional differences between synthetic and real data (\eg synthetic artifacts). As we will show, prior work's approach for using synthetic data to resolve the model's bias toward $B$ do not correct the model's bias toward the pair $(B, G)$, where $G$ denotes whether the sample is real or synthetic. Thus, the model could simply learn signals based on the pair $(B, G)$ (\eg, Synthetic Indoors) to make predictions about $Y$ (\eg, Big Dogs). To address this issue, we propose a simple, easy-to-implement, two-step training pipeline that we call From Fake to Real (FFR). The first step of FFR pre-trains a model on balanced synthetic data to learn robust representations across subgroups. In the second step, FFR fine-tunes the model on real data using ERM or common loss-based bias mitigation methods. By training on real and synthetic data separately, FFR does not expose the model to the statistical differences between real and synthetic data and thus avoids the issue of bias toward the pair $(B, G)$. Our experiments show that FFR improves worst group accuracy over the state-of-the-art by up to 20\% over three datasets. Code available: \url{https://github.com/mqraitem/From-Fake-to-Real}

CVJul 25, 2024
KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models

Eunice Yiu, Maan Qraitem, Anisa Noor Majhi et al. · berkeley

This paper investigates visual analogical reasoning in large multimodal models (LMMs) compared to human adults and children. A "visual analogy" is an abstract rule inferred from one image and applied to another. While benchmarks exist for testing visual reasoning in LMMs, they require advanced skills and omit basic visual analogies that even young children can make. Inspired by developmental psychology, we propose a new benchmark of 4,300 visual transformations of everyday objects to test LMMs on visual analogical reasoning and compare them to children (ages three to five) and to adults. We structure the evaluation into three stages: identifying what changed (e.g., color, number, etc.), how it changed (e.g., added one object), and applying the rule to new scenarios. Our findings show that while GPT-o1, GPT-4V, LLaVA-1.5, and MANTIS identify the "what" effectively, they struggle with quantifying the "how" and extrapolating this rule to new objects. In contrast, children and adults exhibit much stronger analogical reasoning at all three stages. Additionally, the strongest tested model, GPT-o1, performs better in tasks involving simple surface-level visual attributes like color and size, correlating with quicker human adult response times. Conversely, more complex tasks such as number, rotation, and reflection, which necessitate extensive cognitive processing and understanding of extrinsic spatial properties in the physical world, present more significant challenges. Altogether, these findings highlight the limitations of training models on data that primarily consists of 2D images and text.

85.0CLApr 22Code
Breaking the Assistant Mold: Modeling Behavioral Variation in LLM Based Procedural Character Generation

Maan Qraitem, Kate Saenko, Bryan A. Plummer

Procedural content generation has enabled vast virtual worlds through levels, maps, and quests, but large-scale character generation remains underexplored. We identify two alignment-induced biases in existing methods: a positive moral bias, where characters uniformly adopt agreeable stances (e.g. always saying lying is bad), and a helpful assistant bias, where characters invariably answer questions directly (e.g. never refusing or deflecting). While such tendencies suit instruction-following systems, they suppress dramatic tension and yield predictable characters, stemming from maximum likelihood training and assistant fine-tuning. To address this, we introduce PersonaWeaver, a framework that disentangles world-building (roles, demographics) from behavioral-building (moral stances, interactional styles), yielding characters with more diverse reactions and moral stances, as well as second-order diversity in stylistic markers like length, tone, and punctuation. Code: https://github.com/mqraitem/Persona-Weaver

CVFeb 1, 2024Code
Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks

Maan Qraitem, Nazia Tasnim, Piotr Teterwak et al.

Typographic attacks, adding misleading text to images, can deceive vision-language models (LVLMs). The susceptibility of recent large LVLMs like GPT4-V to such attacks is understudied, raising concerns about amplified misinformation in personal assistant applications. Previous attacks use simple strategies, such as random misleading words, which don't fully exploit LVLMs' language reasoning abilities. We introduce an experimental setup for testing typographic attacks on LVLMs and propose two novel self-generated attacks: (1) Class-based attacks, where the model identifies a similar class to deceive itself, and (2) Reasoned attacks, where an advanced LVLM suggests an attack combining a deceiving class and description. Our experiments show these attacks significantly reduce classification performance by up to 60\% and are effective across different models, including InstructBLIP and MiniGPT4. Code: https://github.com/mqraitem/Self-Gen-Typo-Attack

CVMar 17, 2025Code
Web Artifact Attacks Disrupt Vision Language Models

Maan Qraitem, Piotr Teterwak, Kate Saenko et al.

Vision-language models (VLMs) (e.g. CLIP, LLaVA) are trained on large-scale, lightly curated web datasets, leading them to learn unintended correlations between semantic concepts and unrelated visual signals. These associations degrade model accuracy by causing predictions to rely on incidental patterns rather than genuine visual understanding. Prior work has weaponized these correlations as an attack vector to manipulate model predictions, such as inserting a deceiving class text onto the image in a "typographic" attack. These attacks succeed due to VLMs' text-heavy bias-a result of captions that echo visible words rather than describing content. However, this attack has focused solely on text that matches the target class exactly, overlooking a broader range of correlations, including non-matching text and graphical symbols, which arise from the abundance of branding content in web-scale data. To address this gap, we introduce "artifact-based" attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements. Unlike typographic attacks, these artifacts are not predefined, making them simultaneously harder to defend against and more challenging to find. We address this by framing artifact attacks as a search problem and demonstrate their effectiveness across five datasets, with some artifacts reinforcing each other to reach 100% attack success rates. These attacks transfer across models with up to 90% effectiveness, making it possible to attack unseen models. To defend against these attacks, we extend prior work's artifact aware prompting to the graphical setting. We see a moderate reduction of success rates of up to 15% relative to standard prompts, suggesting a promising direction for enhancing model robustness. Code: https://github.com/mqraitem/Web-Artifact-Attacks

CVJun 3, 2024
SLANT: Spurious Logo ANalysis Toolkit

Maan Qraitem, Piotr Teterwak, Kate Saenko et al.

Online content is filled with logos, from ads and social media posts to website branding and product placements. Consequently, these logos are prevalent in the extensive web-scraped datasets used to pretrain Vision-Language Models, which are used for a wide array of tasks (content moderation, object classification). While these models have been shown to learn harmful correlations in various tasks, whether these correlations include logos remains understudied. Understanding this is especially important due to logos often being used by public-facing entities like brands and government agencies. To that end, we develop SLANT: A Spurious Logo ANalysis Toolkit. Our key finding is that some logos indeed lead to spurious incorrect predictions, for example, adding the Adidas logo to a photo of a person causes a model classify the person as greedy. SLANT contains a semi-automatic mechanism for mining such "spurious" logos. The mechanism consists of a comprehensive logo bank, CC12M-LogoBank, and an algorithm that searches the bank for logos that VLMs spuriously correlate with a user-provided downstream recognition target. We uncover various seemingly harmless logos that VL models correlate 1) with negative human adjectives 2) with the concept of `harmlessness'; causing models to misclassify harmful online content as harmless, and 3) with user-provided object concepts; causing lower recognition accuracy on ImageNet zero-shot classification. Furthermore, SLANT's logos can be seen as effective attacks against foundational models; an attacker could place a spurious logo on harmful content, causing the model to misclassify it as harmless. This threat is alarming considering the simplicity of logo attacks, increasing the attack surface of VL models. As a defense, we include in our Toolkit two effective mitigation strategies that seamlessly integrate with zero-shot inference of foundation models.

CVDec 6, 2021
From Coarse to Fine-grained Concept based Discrimination for Phrase Detection

Maan Qraitem, Bryan A. Plummer

Phrase detection requires methods to identify if a phrase is relevant to an image and localize it, if applicable. A key challenge for training more discriminative detection models is sampling negatives. Sampling techniques from prior work focus primarily on hard, often noisy, negatives disregarding the broader distribution of negative samples. Our proposed CFCD-Net addresses this through two novels methods. First, we generate groups of semantically similar words we call concepts (\eg, \{dog, cat, horse\} and \ \{car, truck, SUV\}), and then train our CFCD-Net to discriminate between a region of interest and its unrelated concepts. Second, for phrases containing fine-grained mutually-exclusive words (\eg, colors), we force the model to select only one applicable phrase for each region using our novel fine-grained module (FGM). We evaluate our approach on Flickr30K Entities and RefCOCO+, where we improve mAP over the state-of-the-art by 1.5-2 points. When considering only the phrases affected by our FGM module, we improve by 3-4 points on both datasets.

LGAug 23, 2020
Bridging the Gap: Machine Learning to Resolve Improperly Modeled Dynamics

Maan Qraitem, Dhanushka Kularatne, Eric Forgoston et al.

We present a data-driven modeling strategy to overcome improperly modeled dynamics for systems exhibiting complex spatio-temporal behaviors. We propose a Deep Learning framework to resolve the differences between the true dynamics of the system and the dynamics given by a model of the system that is either inaccurately or inadequately described. Our machine learning strategy leverages data generated from the improper system model and observational data from the actual system to create a neural network to model the dynamics of the actual system. We evaluate the proposed framework using numerical solutions obtained from three increasingly complex dynamical systems. Our results show that our system is capable of learning a data-driven model that provides accurate estimates of the system states both in previously unobserved regions as well as for future states. Our results show the power of state-of-the-art machine learning frameworks in estimating an accurate prior of the system's true dynamics that can be used for prediction up to a finite horizon.