Aparna Bharati

CV
h-index63
18papers
1,331citations
Novelty44%
AI Score44

18 Papers

CVNov 13, 2025
Right Looks, Wrong Reasons: Compositional Fidelity in Text-to-Image Generation

Mayank Vatsa, Aparna Bharati, Richa Singh

The architectural blueprint of today's leading text-to-image models contains a fundamental flaw: an inability to handle logical composition. This survey investigates this breakdown across three core primitives-negation, counting, and spatial relations. Our analysis reveals a dramatic performance collapse: models that are accurate on single primitives fail precipitously when these are combined, exposing severe interference. We trace this failure to three key factors. First, training data show a near-total absence of explicit negations. Second, continuous attention architectures are fundamentally unsuitable for discrete logic. Third, evaluation metrics reward visual plausibility over constraint satisfaction. By analyzing recent benchmarks and methods, we show that current solutions and simple scaling cannot bridge this gap. Achieving genuine compositionality, we conclude, will require fundamental advances in representation and reasoning rather than incremental adjustments to existing architectures.

CVSep 27, 2023
Subjective Face Transform using Human First Impressions

Chaitanya Roygaga, Joshua Krinsky, Kai Zhang et al.

Humans tend to form quick subjective first impressions of non-physical attributes when seeing someone's face, such as perceived trustworthiness or attractiveness. To understand what variations in a face lead to different subjective impressions, this work uses generative models to find semantically meaningful edits to a face image that change perceived attributes. Unlike prior work that relied on statistical manipulation in feature space, our end-to-end framework considers trade-offs between preserving identity and changing perceptual attributes. It maps latent space directions to changes in attribute scores, enabling a perceptually significant identity-preserving transformation of any input face along an attribute axis according to a target change. We train on real and synthetic faces, evaluate for in-domain and out-of-domain images using predictive models and human ratings, demonstrating the generalizability of our approach. Ultimately, such a framework can be used to understand and explain trends and biases in subjective interpretation of faces that are not dependent on the subject's identity. This is demonstrated with improved model performance for first impression prediction when augmenting the training data with images generated by the proposed approach for a wider range of input to learn associations between face features and subjective attributes.

CVDec 9, 2022
A Computer Vision Method for Estimating Velocity from Jumps

Soumyadip Roy, Chaitanya Roygaga, Nathaniel Blanchard et al.

Athletes routinely undergo fitness evaluations to evaluate their training progress. Typically, these evaluations require a trained professional who utilizes specialized equipment like force plates. For the assessment, athletes perform drop and squat jumps, and key variables are measured, e.g. velocity, flight time, and time to stabilization, to name a few. However, amateur athletes may not have access to professionals or equipment that can provide these assessments. Here, we investigate the feasibility of estimating key variables using video recordings. We focus on jump velocity as a starting point because it is highly correlated with other key variables and is important for determining posture and lower-limb capacity. We find that velocity can be estimated with a high degree of precision across a range of athletes, with an average R-value of 0.71 (SD = 0.06).

CVMar 29, 2024
Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations

Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa et al.

Existing vision-language models (VLMs) treat text descriptions as a unit, confusing individual concepts in a prompt and impairing visual semantic matching and reasoning. An important aspect of reasoning in logic and language is negations. This paper highlights the limitations of popular VLMs such as CLIP, at understanding the implications of negations, i.e., the effect of the word "not" in a given prompt. To enable evaluation of VLMs on fluent prompts with negations, we present CC-Neg, a dataset containing 228,246 images, true captions and their corresponding negated captions. Using CC-Neg along with modifications to the contrastive loss of CLIP, our proposed CoN-CLIP framework, has an improved understanding of negations. This training paradigm improves CoN-CLIP's ability to encode semantics reliably, resulting in 3.85% average gain in top-1 accuracy for zero-shot image classification across 8 datasets. Further, CoN-CLIP outperforms CLIP on challenging compositionality benchmarks such as SugarCREPE by 4.4%, showcasing emergent compositional understanding of objects, relations, and attributes in text. Overall, our work addresses a crucial limitation of VLMs by introducing a dataset and framework that strengthens semantic associations between images and text, demonstrating improved large-scale foundation models with significantly reduced computational cost, promoting efficiency and accessibility.

CVFeb 12, 2024
Exploring Saliency Bias in Manipulation Detection

Joshua Krinsky, Alan Bettis, Qiuyu Tang et al.

The social media-fuelled explosion of fake news and misinformation supported by tampered images has led to growth in the development of models and datasets for image manipulation detection. However, existing detection methods mostly treat media objects in isolation, without considering the impact of specific manipulations on viewer perception. Forensic datasets are usually analyzed based on the manipulation operations and corresponding pixel-based masks, but not on the semantics of the manipulation, i.e., type of scene, objects, and viewers' attention to scene content. The semantics of the manipulation play an important role in spreading misinformation through manipulated images. In an attempt to encourage further development of semantic-aware forensic approaches to understand visual misinformation, we propose a framework to analyze the trends of visual and semantic saliency in popular image manipulation datasets and their impact on detection.

CVJun 4, 2025
Is Perturbation-Based Image Protection Disruptive to Image Editing?

Qiuyu Tang, Bonor Ayambem, Mooi Choo Chuah et al.

The remarkable image generation capabilities of state-of-the-art diffusion models, such as Stable Diffusion, can also be misused to spread misinformation and plagiarize copyrighted materials. To mitigate the potential risks associated with image editing, current image protection methods rely on adding imperceptible perturbations to images to obstruct diffusion-based editing. A fully successful protection for an image implies that the output of editing attempts is an undesirable, noisy image which is completely unrelated to the reference image. In our experiments with various perturbation-based image protection methods across multiple domains (natural scene images and artworks) and editing tasks (image-to-image generation and style editing), we discover that such protection does not achieve this goal completely. In most scenarios, diffusion-based editing of protected images generates a desirable output image which adheres precisely to the guidance prompt. Our findings suggest that adding noise to images may paradoxically increase their association with given text prompts during the generation process, leading to unintended consequences such as better resultant edits. Hence, we argue that perturbation-based methods may not provide a sufficient solution for robust image protection against diffusion-based editing.

CVSep 17, 2025
StyleProtect: Safeguarding Artistic Identity in Fine-tuned Diffusion Models

Qiuyu Tang, Joshua Krinsky, Aparna Bharati

The rapid advancement of generative models, particularly diffusion-based approaches, has inadvertently facilitated their potential for misuse. Such models enable malicious exploiters to replicate artistic styles that capture an artist's creative labor, personal vision, and years of dedication in an inexpensive manner. This has led to a rise in the need and exploration of methods for protecting artworks against style mimicry. Although generic diffusion models can easily mimic an artistic style, finetuning amplifies this capability, enabling the model to internalize and reproduce the style with higher fidelity and control. We hypothesize that certain cross-attention layers exhibit heightened sensitivity to artistic styles. Sensitivity is measured through activation strengths of attention layers in response to style and content representations, and assessing their correlations with features extracted from external models. Based on our findings, we introduce an efficient and lightweight protection strategy, StyleProtect, that achieves effective style defense against fine-tuned diffusion models by updating only selected cross-attention layers. Our experiments utilize a carefully curated artwork dataset based on WikiArt, comprising representative works from 30 artists known for their distinctive and influential styles and cartoon animations from the Anita dataset. The proposed method demonstrates promising performance in safeguarding unique styles of artworks and anime from malicious diffusion customization, while maintaining competitive imperceptibility.

CLFeb 10, 2025
From No to Know: Taxonomy, Challenges, and Opportunities for Negation Understanding in Multimodal Foundation Models

Mayank Vatsa, Aparna Bharati, Surbhi Mittal et al.

Negation, a linguistic construct conveying absence, denial, or contradiction, poses significant challenges for multilingual multimodal foundation models. These models excel in tasks like machine translation, text-guided generation, image captioning, audio interactions, and video processing but often struggle to accurately interpret negation across diverse languages and cultural contexts. In this perspective paper, we propose a comprehensive taxonomy of negation constructs, illustrating how structural, semantic, and cultural factors influence multimodal foundation models. We present open research questions and highlight key challenges, emphasizing the importance of addressing these issues to achieve robust negation handling. Finally, we advocate for specialized benchmarks, language-specific tokenization, fine-grained attention mechanisms, and advanced multimodal architectures. These strategies can foster more adaptable and semantically precise multimodal foundation models, better equipped to navigate and accurately interpret the complexities of negation in multilingual, multimodal environments.

CVJan 13, 2020
Learning Transformation-Aware Embeddings for Image Forensics

Aparna Bharati, Daniel Moreira, Patrick Flynn et al.

A dramatic rise in the flow of manipulated image content on the Internet has led to an aggressive response from the media forensics research community. New efforts have incorporated increased usage of techniques from computer vision and machine learning to detect and profile the space of image manipulations. This paper addresses Image Provenance Analysis, which aims at discovering relationships among different manipulated image versions that share content. One of the main sub-problems for provenance analysis that has not yet been addressed directly is the edit ordering of images that share full content or are near-duplicates. The existing large networks that generate image descriptors for tasks such as object recognition may not encode the subtle differences between these image covariates. This paper introduces a novel deep learning-based approach to provide a plausible ordering to images that have been generated from a single image through transformations. Our approach learns transformation-aware descriptors using weak supervision via composited transformations and a rank-based quadruplet loss. To establish the efficacy of the proposed approach, comparisons with state-of-the-art handcrafted and deep learning-based descriptors, and image matching approaches are made. Further experimentation validates the proposed approach in the context of image provenance analysis.

CVMar 24, 2019
Dynamic Spatial Verification for Large-Scale Object-Level Image Retrieval

Joel Brogan, Aparna Bharati, Daniel Moreira et al.

Images from social media can reflect diverse viewpoints, heated arguments, and expressions of creativity, adding new complexity to retrieval tasks. Researchers working onContent-Based Image Retrieval (CBIR) have traditionally tuned their algorithms to match filtered results with user search intent. However, we are now bombarded with composite images of unknown origin, authenticity, and even meaning. With such uncertainty, users may not have an initial idea of what the results of a search query should look like. For instance, hidden people, spliced objects, and subtly altered scenes can be difficult for a user to detect initially in a meme image, but may contribute significantly to its composition. We propose a new approach for spatial verification that aims at modeling object-level regions dynamically clustering keypoints in a 2D Hough space, which are then used to accurately weight small contributing objects within the results, without the need for costly object detection steps. We call this method Objects in Scene to Objects in Scene (OS2OS) score, and it is optimized for fast matrix operations on CPUs. OS2OS performs comparably to state-of-the-art methods in classic CBIR problems, on the Oxford5K, Paris 6K, and Google-Landmarks datasets, without the need for bounding boxes. It also succeeds in emerging retrieval tasks such as image composite matching in the NIST MFC2018 dataset and meme-style composite imagery fromReddit.

CVJul 9, 2018
Beyond Pixels: Image Provenance Analysis Leveraging Metadata

Aparna Bharati, Daniel Moreira, Joel Brogan et al.

Creative works, whether paintings or memes, follow unique journeys that result in their final form. Understanding these journeys, a process known as "provenance analysis", provides rich insights into the use, motivation, and authenticity underlying any given work. The application of this type of study to the expanse of unregulated content on the Internet is what we consider in this paper. Provenance analysis provides a snapshot of the chronology and validity of content as it is uploaded, re-uploaded, and modified over time. Although still in its infancy, automated provenance analysis for online multimedia is already being applied to different types of content. Most current works seek to build provenance graphs based on the shared content between images or videos. This can be a computationally expensive task, especially when considering the vast influx of content that the Internet sees every day. Utilizing non-content-based information, such as timestamps, geotags, and camera IDs can help provide important insights into the path a particular image or video has traveled during its time on the Internet without large computational overhead. This paper tests the scope and applicability of metadata-based inferences for provenance graph construction in two different scenarios: digital image forensics and cultural analytics.

CVJul 3, 2018
Getting the subtext without the text: Scalable multimodal sentiment classification from visual and acoustic modalities

Nathaniel Blanchard, Daniel Moreira, Aparna Bharati et al.

In the last decade, video blogs (vlogs) have become an extremely popular method through which people express sentiment. The ubiquitousness of these videos has increased the importance of multimodal fusion models, which incorporate video and audio features with traditional text features for automatic sentiment detection. Multimodal fusion offers a unique opportunity to build models that learn from the full depth of expression available to human viewers. In the detection of sentiment in these videos, acoustic and video features provide clarity to otherwise ambiguous transcripts. In this paper, we present a multimodal fusion model that exclusively uses high-level video and audio features to analyze spoken sentences for sentiment. We discard traditional transcription features in order to minimize human intervention and to maximize the deployability of our model on at-scale real-world data. We select high-level features for our model that have been successful in nonaffect domains in order to test their generalizability in the sentiment detection domain. We train and test our model on the newly released CMU Multimodal Opinion Sentiment and Emotion Intensity (CMUMOSEI) dataset, obtaining an F1 score of 0.8049 on the validation set and an F1 score of 0.6325 on the held-out challenge test set.

CVJan 19, 2018
Image Provenance Analysis at Scale

Daniel Moreira, Aparna Bharati, Joel Brogan et al.

Prior art has shown it is possible to estimate, through image processing and computer vision techniques, the types and parameters of transformations that have been applied to the content of individual images to obtain new images. Given a large corpus of images and a query image, an interesting further step is to retrieve the set of original images whose content is present in the query image, as well as the detailed sequences of transformations that yield the query image given the original images. This is a problem that recently has received the name of image provenance analysis. In these times of public media manipulation ( e.g., fake news and meme sharing), obtaining the history of image transformations is relevant for fact checking and authorship verification, among many other applications. This article presents an end-to-end processing pipeline for image provenance analysis, which works at real-world scale. It employs a cutting-edge image filtering solution that is custom-tailored for the problem at hand, as well as novel techniques for obtaining the provenance graph that expresses how the images, as nodes, are ancestrally connected. A comprehensive set of experiments for each stage of the pipeline is provided, comparing the proposed solution with state-of-the-art results, employing previously published datasets. In addition, this work introduces a new dataset of real-world provenance cases from the social media site Reddit, along with baseline results.

CVSep 22, 2017
Demography-based Facial Retouching Detection using Subclass Supervised Sparse Autoencoder

Aparna Bharati, Mayank Vatsa, Richa Singh et al.

Digital retouching of face images is becoming more widespread due to the introduction of software packages that automate the task. Several researchers have introduced algorithms to detect whether a face image is original or retouched. However, previous work on this topic has not considered whether or how accuracy of retouching detection varies with the demography of face images. In this paper, we introduce a new Multi-Demographic Retouched Faces (MDRF) dataset, which contains images belonging to two genders, male and female, and three ethnicities, Indian, Chinese, and Caucasian. Further, retouched images are created using two different retouching software packages. The second major contribution of this research is a novel semi-supervised autoencoder incorporating "subclass" information to improve classification. The proposed approach outperforms existing state-of-the-art detection algorithms for the task of generalized retouching detection. Experiments conducted with multiple combinations of ethnicities show that accuracy of retouching detection can vary greatly based on the demographics of the training and testing images.

IRJun 1, 2017
Provenance Filtering for Multimedia Phylogeny

Allan Pinto, Daniel Moreira, Aparna Bharati et al.

Departing from traditional digital forensics modeling, which seeks to analyze single objects in isolation, multimedia phylogeny analyzes the evolutionary processes that influence digital objects and collections over time. One of its integral pieces is provenance filtering, which consists of searching a potentially large pool of objects for the most related ones with respect to a given query, in terms of possible ancestors (donors or contributors) and descendants. In this paper, we propose a two-tiered provenance filtering approach to find all the potential images that might have contributed to the creation process of a given query $q$. In our solution, the first (coarse) tier aims to find the most likely "host" images --- the major donor or background --- contributing to a composite/doctored image. The search is then refined in the second tier, in which we search for more specific (potentially small) parts of the query that might have been extracted from other images and spliced into the query image. Experimental results with a dataset containing more than a million images show that the two-tiered solution underpinned by the context of the query is highly useful for solving this difficult task.

CVMay 31, 2017
U-Phylogeny: Undirected Provenance Graph Construction in the Wild

Aparna Bharati, Daniel Moreira, Allan Pinto et al.

Deriving relationships between images and tracing back their history of modifications are at the core of Multimedia Phylogeny solutions, which aim to combat misinformation through doctored visual media. Nonetheless, most recent image phylogeny solutions cannot properly address cases of forged composite images with multiple donors, an area known as multiple parenting phylogeny (MPP). This paper presents a preliminary undirected graph construction solution for MPP, without any strict assumptions. The algorithm is underpinned by robust image representative keypoints and different geometric consistency checks among matching regions in both images to provide regions of interest for direct comparison. The paper introduces a novel technique to geometrically filter the most promising matches as well as to aid in the shared region localization task. The strength of the approach is corroborated by experiments with real-world cases, with and without image distractors (unrelated cases).

CVMay 1, 2017
Spotting the Difference: Context Retrieval and Analysis for Improved Forgery Detection and Localization

Joel Brogan, Paolo Bestagini, Aparna Bharati et al.

As image tampering becomes ever more sophisticated and commonplace, the need for image forensics algorithms that can accurately and quickly detect forgeries grows. In this paper, we revisit the ideas of image querying and retrieval to provide clues to better localize forgeries. We propose a method to perform large-scale image forensics on the order of one million images using the help of an image search algorithm and database to gather contextual clues as to where tampering may have taken place. In this vein, we introduce five new strongly invariant image comparison methods and test their effectiveness under heavy noise, rotation, and color space changes. Lastly, we show the effectiveness of these methods compared to passive image forensics using Nimble [https://www.nist.gov/itl/iad/mig/nimble-challenge], a new, state-of-the-art dataset from the National Institute of Standards and Technology (NIST).

CVOct 16, 2016
To Frontalize or Not To Frontalize: Do We Really Need Elaborate Pre-processing To Improve Face Recognition?

Sandipan Banerjee, Joel Brogan, Janez Krizaj et al.

Face recognition performance has improved remarkably in the last decade. Much of this success can be attributed to the development of deep learning techniques such as convolutional neural networks (CNNs). While CNNs have pushed the state-of-the-art forward, their training process requires a large amount of clean and correctly labelled training data. If a CNN is intended to tolerate facial pose, then we face an important question: should this training data be diverse in its pose distribution, or should face images be normalized to a single pose in a pre-processing step? To address this question, we evaluate a number of popular facial landmarking and pose correction algorithms to understand their effect on facial recognition performance. Additionally, we introduce a new, automatic, single-image frontalization scheme that exceeds the performance of current algorithms. CNNs trained using sets of different pre-processing methods are used to extract features from the Point and Shoot Challenge (PaSC) and CMU Multi-PIE datasets. We assert that the subsequent verification and recognition performance serves to quantify the effectiveness of each pose correction scheme.