Anubhav Jain

h-index69

15papers

363citations

Novelty50%

AI Score38

Ranked #84,789 of 194,257 authors (top 44%)#28,584 in CV (top 48%)

15 Papers

10.2CLDec 10, 2022

Structured information extraction from complex scientific text with fine-tuned large language models

Alexander Dunn, John Dagdelen, Nicholas Walker et al. · princeton

Intelligently extracting and linking complex scientific information from unstructured text is a challenging endeavor particularly for those inexperienced with natural language processing. Here, we present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction for complex hierarchical information in scientific text. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts (inputs) and completions (outputs). Information is extracted either from single sentences or across sentences in abstracts/passages, and the output can be returned as simple English sentences or a more structured format, such as a list of JSON objects. We demonstrate that LLMs trained in this way are capable of accurately extracting useful records of complex scientific knowledge for three representative tasks in materials chemistry: linking dopants with their host materials, cataloging metal-organic frameworks, and general chemistry/phase/morphology/application information extraction. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text. An online demo is available at http://www.matscholar.com/info-extraction.

17.0MTRL-SCIAug 28, 2023Code

Matbench Discovery -- A framework to evaluate machine learning crystal stability predictions

Janosh Riebesell, Rhys E. A. Goodall, Philipp Benner et al.

The rapid adoption of machine learning (ML) in domain sciences necessitates best practices and standardized benchmarking for performance evaluation. We present Matbench Discovery, an evaluation framework for ML energy models, applied as pre-filters for high-throughput searches of stable inorganic crystals. This framework addresses the disconnect between thermodynamic stability and formation energy, as well as retrospective vs. prospective benchmarking in materials discovery. We release a Python package to support model submissions and maintain an online leaderboard, offering insights into performance trade-offs. To identify the best-performing ML methodologies for materials discovery, we benchmarked various approaches, including random forests, graph neural networks (GNNs), one-shot predictors, iterative Bayesian optimizers, and universal interatomic potentials (UIP). Our initial results rank models by test set F1 scores for thermodynamic stability prediction: EquiformerV2 + DeNS > Orb > SevenNet > MACE > CHGNet > M3GNet > ALIGNN > MEGNet > CGCNN > CGCNN+P > Wrenformer > BOWSR > Voronoi fingerprint random forest. UIPs emerge as the top performers, achieving F1 scores of 0.57-0.82 and discovery acceleration factors (DAF) of up to 6x on the first 10k stable predictions compared to random selection. We also identify a misalignment between regression metrics and task-relevant classification metrics. Accurate regressors can yield high false-positive rates near the decision boundary at 0 eV/atom above the convex hull. Our results demonstrate UIPs' ability to optimize computational budget allocation for expanding materials databases. However, their limitations remain underexplored in traditional benchmarks. We advocate for task-based evaluation frameworks, as implemented here, to address these limitations and advance ML-guided materials discovery.

2.3APP-PHApr 26, 2023

Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from Literature with GPT-3

Nicholas Walker, John Dagdelen, Kevin Cruse et al.

Although gold nanorods have been the subject of much research, the pathways for controlling their shape and thereby their optical properties remain largely heuristically understood. Although it is apparent that the simultaneous presence of and interaction between various reagents during synthesis control these properties, computational and experimental approaches for exploring the synthesis space can be either intractable or too time-consuming in practice. This motivates an alternative approach leveraging the wealth of synthesis information already embedded in the body of scientific literature by developing tools to extract relevant structured data in an automated, high-throughput manner. To that end, we present an approach using the powerful GPT-3 language model to extract structured multi-step seed-mediated growth procedures and outcomes for gold nanorods from unstructured scientific text. GPT-3 prompt completions are fine-tuned to predict synthesis templates in the form of JSON documents from unstructured text input with an overall accuracy of $86\%$. The performance is notable, considering the model is performing simultaneous entity recognition and relation extraction. We present a dataset of 11,644 entities extracted from 1,137 papers, resulting in 268 papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.

9.4SDApr 24, 2022Code

Dictionary Attacks on Speaker Verification

Mirko Marras, Pawel Korus, Anubhav Jain et al.

In this paper, we propose dictionary attacks against speaker verification - a novel attack vector that aims to match a large fraction of speaker population by chance. We introduce a generic formulation of the attack that can be used with various speech representations and threat models. The attacker uses adversarial optimization to maximize raw similarity of speaker embeddings between a seed speech sample and a proxy population. The resulting master voice successfully matches a non-trivial fraction of people in an unknown population. Adversarial waveforms obtained with our approach can match on average 69% of females and 38% of males enrolled in the target system at a strict decision threshold calibrated to yield false alarm rate of 1%. By using the attack with a black-box voice cloning system, we obtain master voices that are effective in the most challenging conditions and transferable between speaker encoders. We also show that, combined with multiple attempts, this attack opens even more to serious issues on the security of these systems.

6.5CVDec 5, 2022Code

A Dataless FaceSwap Detection Approach Using Synthetic Images

Anubhav Jain, Nasir Memon, Julian Togelius

Face swapping technology used to create "Deepfakes" has advanced significantly over the past few years and now enables us to create realistic facial manipulations. Current deep learning algorithms to detect deepfakes have shown promising results, however, they require large amounts of training data, and as we show they are biased towards a particular ethnicity. We propose a deepfake detection methodology that eliminates the need for any real data by making use of synthetically generated data using StyleGAN3. This not only performs at par with the traditional training methodology of using real data but it shows better generalization capabilities when finetuned with a small amount of real data. Furthermore, this also reduces biases created by facial image datasets that might have sparse data from particular ethnicities.

1.5CVAug 16, 2023

Fair GANs through model rebalancing for extremely imbalanced class distributions

Anubhav Jain, Nasir Memon, Julian Togelius

Deep generative models require large amounts of training data. This often poses a problem as the collection of datasets can be expensive and difficult, in particular datasets that are representative of the appropriate underlying distribution (e.g. demographic). This introduces biases in datasets which are further propagated in the models. We present an approach to construct an unbiased generative adversarial network (GAN) from an existing biased GAN by rebalancing the model distribution. We do so by generating balanced data from an existing imbalanced deep generative model using an evolutionary algorithm and then using this data to train a balanced generative model. Additionally, we propose a bias mitigation loss function that minimizes the deviation of the learned class distribution from being equiprobable. We show results for the StyleGAN2 models while training on the Flickr Faces High Quality (FFHQ) dataset for racial fairness and see that the proposed approach improves on the fairness metric by almost 5 times, whilst maintaining image quality. We further validate our approach by applying it to an imbalanced CIFAR10 dataset where we show that we can obtain comparable fairness and image quality as when training on a balanced CIFAR10 dataset which is also twice as large. Lastly, we argue that the traditionally used image quality metrics such as Frechet inception distance (FID) are unsuitable for scenarios where the class distributions are imbalanced and a balanced reference set is not available.

3.7CVNov 20, 2023

Alpha-wolves and Alpha-mammals: Exploring Dictionary Attacks on Iris Recognition Systems

Sudipta Banerjee, Anubhav Jain, Zehua Jiang et al.

A dictionary attack in a biometric system entails the use of a small number of strategically generated images or templates to successfully match with a large number of identities, thereby compromising security. We focus on dictionary attacks at the template level, specifically the IrisCodes used in iris recognition systems. We present an hitherto unknown vulnerability wherein we mix IrisCodes using simple bitwise operators to generate alpha-mixtures - alpha-wolves (combining a set of "wolf" samples) and alpha-mammals (combining a set of users selected via search optimization) that increase false matches. We evaluate this vulnerability using the IITD, CASIA-IrisV4-Thousand and Synthetic datasets, and observe that an alpha-wolf (from two wolves) can match upto 71 identities @FMR=0.001%, while an alpha-mammal (from two identities) can match upto 133 other identities @FMR=0.01% on the IITD dataset.

3.6CVApr 8, 2025Code

FaceCloak: Learning to Protect Face Templates

Sudipta Banerjee, Anubhav Jain, Chinmay Hegde et al.

Generative models can reconstruct face images from encoded representations (templates) bearing remarkable likeness to the original face, raising security and privacy concerns. We present \textsc{FaceCloak}, a neural network framework that protects face templates by generating smart, renewable binary cloaks. Our method proactively thwarts inversion attacks by cloaking face templates with unique disruptors synthesized from a single face template on the fly while provably retaining biometric utility and unlinkability. Our cloaked templates can suppress sensitive attributes while generalizing to novel feature extraction schemes and outperform leading baselines in terms of biometric matching and resiliency to reconstruction attacks. \textsc{FaceCloak}-based matching is extremely fast (inference time =0.28 ms) and light (0.57 MB). We have released our \href{https://github.com/sudban3089/FaceCloak.git}{code} for reproducible research.

15.3CVDec 10, 2024Code

TraSCE: Trajectory Steering for Concept Erasure

Anubhav Jain, Yuya Kobayashi, Takashi Shibuya et al.

Recent advancements in text-to-image diffusion models have brought them to the public spotlight, becoming widely accessible and embraced by everyday users. However, these models have been shown to generate harmful content such as not-safe-for-work (NSFW) images. While approaches have been proposed to erase such abstract concepts from the models, jail-breaking techniques have succeeded in bypassing such safety measures. In this paper, we propose TraSCE, an approach to guide the diffusion trajectory away from generating harmful content. Our approach is based on negative prompting, but as we show in this paper, a widely used negative prompting strategy is not a complete solution and can easily be bypassed in some corner cases. To address this issue, we first propose using a specific formulation of negative prompting instead of the widely used one. Furthermore, we introduce a localized loss-based guidance that enhances the modified negative prompting technique by steering the diffusion trajectory. We demonstrate that our proposed method achieves state-of-the-art results on various benchmarks in removing harmful content, including ones proposed by red teams, and erasing artistic styles and objects. Our proposed approach does not require any training, weight modifications, or training data (either image or prompt), making it easier for model owners to erase new concepts.

21.1CVApr 27, 2025Code

Forging and Removing Latent-Noise Diffusion Watermarks Using a Single Image

Anubhav Jain, Yuya Kobayashi, Naoki Murata et al.

Watermarking techniques are vital for protecting intellectual property and preventing fraudulent use of media. Most previous watermarking schemes designed for diffusion models embed a secret key in the initial noise. The resulting pattern is often considered hard to remove and forge into unrelated images. In this paper, we propose a black-box adversarial attack without presuming access to the diffusion model weights. Our attack uses only a single watermarked example and is based on a simple observation: there is a many-to-one mapping between images and initial noises. There are regions in the clean image latent space pertaining to each watermark that get mapped to the same initial noise when inverted. Based on this intuition, we propose an adversarial attack to forge the watermark by introducing perturbations to the images such that we can enter the region of watermarked images. We show that we can also apply a similar approach for watermark removal by learning perturbations to exit this region. We report results on multiple watermarking schemes (Tree-Ring, RingID, WIND, and Gaussian Shading) across two diffusion models (SDv1.4 and SDv2.0). Our results demonstrate the effectiveness of the attack and expose vulnerabilities in the watermarking methods, motivating future research on improving them.

14.1CVNov 23, 2024Code

Classifier-Free Guidance inside the Attraction Basin May Cause Memorization

Anubhav Jain, Yuya Kobayashi, Takashi Shibuya et al.

Diffusion models are prone to exactly reproduce images from the training data. This exact reproduction of the training data is concerning as it can lead to copyright infringement and/or leakage of privacy-sensitive information. In this paper, we present a novel perspective on the memorization phenomenon and propose a simple yet effective approach to mitigate it. We argue that memorization occurs because of an attraction basin in the denoising process which steers the diffusion trajectory towards a memorized image. However, this can be mitigated by guiding the diffusion trajectory away from the attraction basin by not applying classifier-free guidance until an ideal transition point occurs from which classifier-free guidance is applied. This leads to the generation of non-memorized images that are high in image quality and well-aligned with the conditioning mechanism. To further improve on this, we present a new guidance technique, opposite guidance, that escapes the attraction basin sooner in the denoising process. We demonstrate the existence of attraction basins in various scenarios in which memorization occurs, and we show that our proposed approach successfully mitigates memorization.

6.8CVMay 12, 2023Code

Zero-shot racially balanced dataset generation using an existing biased StyleGAN2

Anubhav Jain, Nasir Memon, Julian Togelius

Facial recognition systems have made significant strides thanks to data-heavy deep learning models, but these models rely on large privacy-sensitive datasets. Further, many of these datasets lack diversity in terms of ethnicity and demographics, which can lead to biased models that can have serious societal and security implications. To address these issues, we propose a methodology that leverages the biased generative model StyleGAN2 to create demographically diverse images of synthetic individuals. The synthetic dataset is created using a novel evolutionary search algorithm that targets specific demographic groups. By training face recognition models with the resulting balanced dataset containing 50,000 identities per race (13.5 million images in total), we can improve their performance and minimize biases that might have been present in a model trained on a real dataset.

2.3OHOct 20, 2019

Benchmark Dataset for Timetable Optimization of Bus Routes in the City of New Delhi

Anubhav Jain, Avdesh Kumar, Saumya Balodi et al.

Public transport is one of the major forms of transportation in the world. This makes it vital to ensure that public transport is efficient. This research presents a novel real-time GPS bus transit data for over 500 routes of buses operating in New Delhi. The data can be used for modeling various timetable optimization tasks as well as in other domains such as traffic management, travel time estimation, etc. The paper also presents an approach to reduce the waiting time of Delhi buses by analyzing the traffic behavior and proposing a timetable. This algorithm serves as a benchmark for the dataset. The algorithm uses a constrained clustering algorithm for classification of trips. It further analyses the data statistically to provide a timetable which is efficient in learning the inter- and intra-month variations.

9.8CVJan 26, 2019

On Detecting GANs and Retouching based Synthetic Alterations

Anubhav Jain, Richa Singh, Mayank Vatsa

Digitally retouching images has become a popular trend, with people posting altered images on social media and even magazines posting flawless facial images of celebrities. Further, with advancements in Generative Adversarial Networks (GANs), now changing attributes and retouching have become very easy. Such synthetic alterations have adverse effect on face recognition algorithms. While researchers have proposed to detect image tampering, detecting GANs generated images has still not been explored. This paper proposes a supervised deep learning algorithm using Convolutional Neural Networks (CNNs) to detect synthetically altered images. The algorithm yields an accuracy of 99.65% on detecting retouching on the ND-IIITD dataset. It outperforms the previous state of the art which reported an accuracy of 87% on the database. For distinguishing between real images and images generated using GANs, the proposed algorithm yields an accuracy of 99.83%.

3.8SEOct 16, 2015

A Community Contribution Framework for Sharing Materials Data with Materials Project

Patrick Huck, Anubhav Jain, Dan Gunter et al.

As scientific discovery becomes increasingly data-driven, software platforms are needed to efficiently organize and disseminate data from disparate sources. This is certainly the case in the field of materials science. For example, Materials Project has generated computational data on over 60,000 chemical compounds and has made that data available through a web portal and REST interface. However, such portals must seek to incorporate community submissions to expand the scope of scientific data sharing. In this paper, we describe MPContribs, a computing/software infrastructure to integrate and organize contributions of simulated or measured materials data from users. Our solution supports complex submissions and provides interfaces that allow contributors to share analyses and graphs. A RESTful API exposes mechanisms for book-keeping, retrieval and aggregation of submitted entries, as well as persistent URIs or DOIs that can be used to reference the data in publications. Our approach isolates contributed data from a host project's quality-controlled core data and yet enables analyses across the entire dataset, programmatically or through customized web apps. We expect the developed framework to enhance collaborative determination of material properties and to maximize the impact of each contributor's dataset. In the long-term, MPContribs seeks to make Materials Project an institutional, and thus community-wide, memory for computational and experimental materials science.