CVJun 9, 2022
ReFace: Real-time Adversarial Attacks on Face Recognition SystemsShehzeen Hussain, Todd Huster, Chris Mesterharm et al. · gatech, harvard
Deep neural network based face recognition models have been shown to be vulnerable to adversarial examples. However, many of the past attacks require the adversary to solve an input-dependent optimization problem using gradient descent which makes the attack impractical in real-time. These adversarial examples are also tightly coupled to the attacked model and are not as successful in transferring to different models. In this work, we propose ReFace, a real-time, highly-transferable attack on face recognition models based on Adversarial Transformation Networks (ATNs). ATNs model adversarial example generation as a feed-forward neural network. We find that the white-box attack success rate of a pure U-Net ATN falls substantially short of gradient-based attacks like PGD on large face recognition datasets. We therefore propose a new architecture for ATNs that closes this gap while maintaining a 10000x speedup over PGD. Furthermore, we find that at a given perturbation magnitude, our ATN adversarial perturbations are more effective in transferring to new face recognition models than PGD. ReFace attacks can successfully deceive commercial face recognition services in a transfer attack setting and reduce face identification accuracy from 82% to 16.4% for AWS SearchFaces API and Azure face verification accuracy from 91% to 50.1%.
HCJun 8, 2022
Explanation as Question Answering based on a Task Model of the Agent's DesignAshok Goel, Harshvardhan Sikka, Vrinda Nandan et al.
We describe a stance towards the generation of explanations in AI agents that is both human-centered and design-based. We collect questions about the working of an AI agent through participatory design by focus groups. We capture an agent's design through a Task-Method-Knowledge model that explicitly specifies the agent's tasks and goals, as well as the mechanisms, knowledge and vocabulary it uses for accomplishing the tasks. We illustrate our approach through the generation of explanations in Skillsync, an AI agent that links companies and colleges for worker upskilling and reskilling. In particular, we embed a question-answering agent called AskJill in Skillsync, where AskJill contains a TMK model of Skillsync's design. AskJill presently answers human-generated questions about Skillsync's tasks and vocabulary, and thereby helps explain how it produces its recommendations.
HCAug 1, 2023
Designing a Communication Bridge between Communities: Participatory Design for a Question-Answering AI AgentJeonghyun Lee, Vrinda Nandan, Harshvardhan Sikka et al.
How do we design an AI system that is intended to act as a communication bridge between two user communities with different mental models and vocabularies? Skillsync is an interactive environment that engages employers (companies) and training providers (colleges) in a sustained dialogue to help them achieve the goal of building a training proposal that successfully meets the needs of the employers and employees. We used a variation of participatory design to elicit requirements for developing AskJill, a question-answering agent that explains how Skillsync works and thus acts as a communication bridge between company and college users. Our study finds that participatory design was useful in guiding the requirements gathering and eliciting user questions for the development of AskJill. Our results also suggest that the two Skillsync user communities perceived glossary assistance as a key feature that AskJill needs to offer, and they would benefit from such a shared vocabulary.
AIApr 21, 2022
A Framework for Interactive Knowledge-Aided Machine TeachingKaran Taneja, Harshvardhan Sikka, Ashok Goel
Machine Teaching (MT) is an interactive process where humans train a machine learning model by playing the role of a teacher. The process of designing an MT system involves decisions that can impact both efficiency of human teachers and performance of machine learners. Previous research has proposed and evaluated specific MT systems but there is limited discussion on a general framework for designing them. We propose a framework for designing MT systems and also detail a system for the text classification problem as a specific instance. Our framework focuses on three components i.e. teaching interface, machine learner, and knowledge base; and their relations describe how each component can benefit the others. Our preliminary experiments show how MT systems can reduce both human teaching time and machine learner error rate.
HCJun 10, 2022
Human-AI Interaction Design in Machine TeachingKaran Taneja, Harshvardhan Sikka, Ashok Goel
Machine Teaching (MT) is an interactive process where a human and a machine interact with the goal of training a machine learning model (ML) for a specified task. The human teacher communicates their task expertise and the machine student gathers the required data and knowledge to produce an ML model. MT systems are developed to jointly minimize the time spent on teaching and the learner's error rate. The design of human-AI interaction in an MT system not only impacts the teaching efficiency, but also indirectly influences the ML performance by affecting the teaching quality. In this paper, we build upon our previous work where we proposed an MT framework with three components, viz., the teaching interface, the machine learner, and the knowledge base, and focus on the human-AI interaction design involved in realizing the teaching interface. We outline design decisions that need to be addressed in developing an MT system beginning from an ML task. The paper follows the Socratic method entailing a dialogue between a curious student and a wise teacher.
80.4LGApr 15
GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding ModelsYangyue Wang, Harshvardhan Sikka, Yash Mathur et al.
GUI grounding models report over 85% accuracy on standard benchmarks, yet drop 27-56 percentage points when instructions require spatial reasoning rather than direct element naming. Current benchmarks miss this because they evaluate each screenshot once with a single fixed instruction. We introduce GUI-Perturbed, a controlled perturbation framework that independently varies visual scenes and instructions to measure grounding robustness. Evaluating three 7B models from the same architecture lineage, we find that relational instructions cause systematic accuracy collapse across all models, a 70% browser zoom produces statistically significant degradation, and rank-8 LoRA fine-tuning with augmented data degrades performance rather than improving it. By perturbing along independent axes, GUI-Perturbed isolates which specific capability axes are affected-spatial reasoning, visual robustness, reasoning calibration-providing diagnostic signal that aggregate benchmarks cannot. We release the dataset, augmentation pipeline, and a fine-tuned model.
LGJun 10, 2025Code
An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action ModelsPranav Guruprasad, Yangyue Wang, Sudipta Chowdhury et al. · gatech, harvard
Recent innovations in multimodal action models represent a promising direction for developing general-purpose agentic systems, combining visual understanding, language comprehension, and action generation. We introduce MultiNet - a novel, fully open-source benchmark and surrounding software ecosystem designed to rigorously evaluate and adapt models across vision, language, and action domains. We establish standardized evaluation protocols for assessing vision-language models (VLMs) and vision-language-action models (VLAs), and provide open source software to download relevant data, models, and evaluations. Additionally, we provide a composite dataset with over 1.3 trillion tokens of image captioning, visual question answering, commonsense reasoning, robotic control, digital game-play, simulated locomotion/manipulation, and many more tasks. The MultiNet benchmark, framework, toolkit, and evaluation harness have been used in downstream research on the limitations of VLA generalization.
RONov 4, 2024
Benchmarking Vision, Language, & Action Models on Robotic Learning TasksPranav Guruprasad, Harshvardhan Sikka, Jaewoo Song et al. · gatech, harvard
Vision-language-action (VLA) models represent a promising direction for developing general-purpose robotic systems, demonstrating the ability to combine visual understanding, language comprehension, and action generation. However, systematic evaluation of these models across diverse robotic tasks remains limited. In this work, we present a comprehensive evaluation framework and benchmark suite for assessing VLA models. We profile three state-of-the-art VLM and VLAs - GPT-4o, OpenVLA, and JAT - across 20 diverse datasets from the Open-X-Embodiment collection, evaluating their performance on various manipulation tasks. Our analysis reveals several key insights: 1. current VLA models show significant variation in performance across different tasks and robot platforms, with GPT-4o demonstrating the most consistent performance through sophisticated prompt engineering, 2. all models struggle with complex manipulation tasks requiring multi-step planning, and 3. model performance is notably sensitive to action space characteristics and environmental factors. We release our evaluation framework and findings to facilitate systematic assessment of future VLA models and identify critical areas for improvement in the development of general purpose robotic systems.
CVMay 8, 2025
Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action EnvironmentsPranav Guruprasad, Yangyue Wang, Sudipta Chowdhury et al. · gatech, harvard
Vision-language-action (VLA) models represent an important step toward general-purpose robotic systems by integrating visual perception, language understanding, and action execution. However, systematic evaluation of these models, particularly their zero-shot generalization capabilities in procedurally out-of-distribution (OOD) environments, remains limited. In this paper, we introduce MultiNet v0.2, a comprehensive benchmark designed to evaluate and analyze the generalization performance of state-of-the-art VLMs and VLAs - including GPT-4o, GPT-4.1, OpenVLA, Pi0 Base, and Pi0 FAST - on diverse procedural tasks from the Procgen benchmark. Our analysis reveals several critical insights: (1) all evaluated models exhibit significant limitations in zero-shot generalization to OOD tasks, with performance heavily influenced by factors such as action representation and task complexity; (2) VLAs generally outperforms other models due to their robust architectural design; and (3) VLM variants demonstrate substantial improvements when constrained appropriately, highlighting the sensitivity of model performance to precise prompt engineering. We release our benchmark, evaluation framework, and findings to enable the assessment of future VLA models and identify critical areas for improvement in their application to out-of-distribution digital tasks.
CYMay 8, 2025
A4L: An Architecture for AI-Augmented LearningAshok Goel, Ploy Thajchayapong, Vrinda Nandan et al.
AI promises personalized learning and scalable education. As AI agents increasingly permeate education in support of teaching and learning, there is a critical and urgent need for data architectures for collecting and analyzing data on learning, and feeding the results back to teachers, learners, and the AI agents for personalization of learning at scale. At the National AI Institute for Adult Learning and Online Education, we are developing an Architecture for AI-Augmented Learning (A4L) for supporting adult learning through online education. We present the motivations, goals, requirements of the A4L architecture. We describe preliminary applications of A4L and discuss how it advances the goals of making learning more personalized and scalable.
LGDec 22, 2021
Agent Smith: Teaching Question Answering to Jill WatsonAshok Goel, Harshvardhan Sikka, Eric Gregori
Building AI agents can be costly. Consider a question answering agent such as Jill Watson that automatically answers students' questions on the discussion forums of online classes based on their syllabi and other course materials. Training a Jill on the syllabus of a new online class can take a hundred hours or more. Machine teaching - interactive teaching of an AI agent using synthetic data sets - can reduce the training time because it combines the advantages of knowledge-based AI, machine learning using large data sets, and interactive human-in-loop training. We describe Agent Smith, an interactive machine teaching agent that reduces the time taken to train a Jill for a new online class by an order of magnitude.
LGJul 7, 2021
WeightScale: Interpreting Weight Change in Neural NetworksAyush Manish Agrawal, Atharva Tendle, Harshvardhan Sikka et al.
Interpreting the learning dynamics of neural networks can provide useful insights into how networks learn and the development of better training and design approaches. We present an approach to interpret learning in neural networks by measuring relative weight change on a per layer basis and dynamically aggregating emerging trends through combination of dimensionality reduction and clustering which allows us to scale to very deep networks. We use this approach to investigate learning in the context of vision tasks across a variety of state-of-the-art networks and provide insights into the learning behavior of these networks, including how task complexity affects layer-wise learning in deeper layers of networks.
LGNov 13, 2020
Investigating Learning in Deep Neural Networks using Layer-Wise Weight ChangeAyush Manish Agrawal, Atharva Tendle, Harshvardhan Sikka et al.
Understanding the per-layer learning dynamics of deep neural networks is of significant interest as it may provide insights into how neural networks learn and the potential for better training regimens. We investigate learning in Deep Convolutional Neural Networks (CNNs) by measuring the relative weight change of layers while training. Several interesting trends emerge in a variety of CNN architectures across various computer vision classification tasks, including the overall increase in relative weight change of later layers as compared to earlier ones.
NEOct 27, 2020
A Genetic Algorithm Based Approach for Satellite AutonomySidhdharth Sikka, Harshvardhan Sikka
Autonomous spacecraft maneuver planning using an evolutionary algorithmic approach is investigated. Simulated spacecraft were placed into four different initial orbits. Each was allowed a string of thirty delta-v impulse maneuvers in six cartesian directions, the positive and negative x, y and z directions. The goal of the spacecraft maneuver string was to, starting from some non-polar starting orbit, place the spacecraft into a polar, low eccentricity orbit. A genetic algorithm was implemented, using a mating, fitness, mutation and crossover scheme for impulse strings. The genetic algorithm was successfully able to produce this result for all the starting orbits. Performance and future work is also discussed.
LGMay 27, 2020
Benchmarking Differentially Private Residual Networks for Medical ImagerySahib Singh, Harshvardhan Sikka, Sasikanth Kotti et al.
In this paper we measure the effectiveness of $ε$-Differential Privacy (DP) when applied to medical imaging. We compare two robust differential privacy mechanisms: Local-DP and DP-SGD and benchmark their performance when analyzing medical imagery records. We analyze the trade-off between the model's accuracy and the level of privacy it guarantees, and also take a closer look to evaluate how useful these theoretical privacy guarantees actually prove to be in the real world medical setting.
LGApr 25, 2020
A Deeper Look at the Unsupervised Learning of Disentangled Representations in $β$-VAE from the Perspective of Core Object RecognitionHarshvardhan Sikka
The ability to recognize objects despite there being differences in appearance, known as Core Object Recognition, forms a critical part of human perception. While it is understood that the brain accomplishes Core Object Recognition through feedforward, hierarchical computations through the visual stream, the underlying algorithms that allow for invariant representations to form downstream is still not well understood. (DiCarlo et al., 2012) Various computational perceptual models have been built to attempt and tackle the object identification task in an artificial perceptual setting. Artificial Neural Networks, computational graphs consisting of weighted edges and mathematical operations at vertices, are loosely inspired by neural networks in the brain and have proven effective at various visual perceptual tasks, including object characterization and identification. (Pinto et al., 2008) (DiCarlo et al., 2012) For many data analysis tasks, learning representations where each dimension is statistically independent and thus disentangled from the others is useful. If the underlying generative factors of the data are also statistically independent, Bayesian inference of latent variables can form disentangled representations. This thesis constitutes a research project exploring a generalization of the Variational Autoencoder (VAE), $β$-VAE, that aims to learn disentangled representations using variational inference. $β$-VAE incorporates the hyperparameter $β$, and enforces conditional independence of its bottleneck neurons, which is in general not compatible with the statistical independence of latent variables. This text examines this architecture, and provides analytical and numerical arguments, with the goal of demonstrating that this incompatibility leads to a non-monotonic inference performance in $β$-VAE with a finite optimal $β$.
MLDec 11, 2019
A Closer Look at Disentangling in $β$-VAEHarshvardhan Sikka, Weishun Zhong, Jun Yin et al.
In many data analysis tasks, it is beneficial to learn representations where each dimension is statistically independent and thus disentangled from the others. If data generating factors are also statistically independent, disentangled representations can be formed by Bayesian inference of latent variables. We examine a generalization of the Variational Autoencoder (VAE), $β$-VAE, for learning such representations using variational inference. $β$-VAE enforces conditional independence of its bottleneck neurons controlled by its hyperparameter $β$. This condition is in general not compatible with the statistical independence of latents. By providing analytical and numerical arguments, we show that this incompatibility leads to a non-monotonic inference performance in $β$-VAE with a finite optimal $β$.