Xuwang Yin

LG
h-index58
20papers
4,871citations
Novelty52%
AI Score44

20 Papers

LGOct 2, 2023
Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen et al. · berkeley, cmu

In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

SPApr 30, 2022Code
End-to-End Signal Classification in Signed Cumulative Distribution Transform Space

Abu Hasnat Mohammad Rubaiyat, Shiying Li, Xuwang Yin et al.

This paper presents a new end-to-end signal classification method using the signed cumulative distribution transform (SCDT). We adopt a transport-based generative model to define the classification problem. We then make use of mathematical properties of the SCDT to render the problem easier in transform domain, and solve for the class of an unknown sample using a nearest local subspace (NLS) search algorithm in SCDT domain. Experiments show that the proposed method provides high accuracy classification results while being data efficient, robust to out-of-distribution samples, and competitive in terms of computational complexity with respect to the deep learning end-to-end classification methods. The implementation of the proposed method in Python language is integrated as a part of the software package PyTransKit (https://github.com/rohdelab/PyTransKit).

LGJul 31, 2024
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Richard Ren, Steven Basart, Adam Khoja et al.

As artificial intelligence systems grow more powerful, there has been increasing interest in "AI safety" research to address emerging and future risks. However, the field of AI safety remains poorly defined and inconsistently measured, leading to confusion about how researchers can contribute. This lack of clarity is compounded by the unclear relationship between AI safety benchmarks and upstream general capabilities (e.g., general knowledge and reasoning). To address these issues, we conduct a comprehensive meta-analysis of AI safety benchmarks, empirically analyzing their correlation with general capabilities across dozens of models and providing a survey of existing directions in AI safety. Our findings reveal that many safety benchmarks highly correlate with both upstream model capabilities and training compute, potentially enabling "safetywashing"--where capability improvements are misrepresented as safety advancements. Based on these findings, we propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context as a set of clearly delineated research goals that are empirically separable from generic capabilities advancements. In doing so, we aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.

LGFeb 6, 2024Code
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin et al. · berkeley, cmu

Automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (LLMs), yet the field lacks a standardized evaluation framework to rigorously assess new methods. To address this issue, we introduce HarmBench, a standardized evaluation framework for automated red teaming. We identify several desirable properties previously unaccounted for in red teaming evaluations and systematically design HarmBench to meet these criteria. Using HarmBench, we conduct a large-scale comparison of 18 red teaming methods and 33 target LLMs and defenses, yielding novel insights. We also introduce a highly efficient adversarial training method that greatly enhances LLM robustness across a wide range of attacks, demonstrating how HarmBench enables codevelopment of attacks and defenses. We open source HarmBench at https://github.com/centerforaisafety/HarmBench.

LGDec 14, 2022
Generative Robust Classification

Xuwang Yin

Training adversarially robust discriminative (i.e., softmax) classifier has been the dominant approach to robust classification. Building on recent work on adversarial training (AT)-based generative models, we investigate using AT to learn unnormalized class-conditional density models and then performing generative robust classification. Our result shows that, under the condition of similar model capacities, the generative robust classifier achieves comparable performance to a baseline softmax robust classifier when the test data is clean or when the test perturbation is of limited size, and much better performance when the test perturbation size exceeds the training perturbation size. The generative classifier is also able to generate samples or counterfactuals that more closely resemble the training data, suggesting that the generative classifier can better capture the class-conditional distributions. In contrast to standard discriminative adversarial training where advanced data augmentation techniques are only effective when combined with weight averaging, we find it straightforward to apply advanced data augmentation to achieve better robustness in our approach. Our result suggests that the generative classifier is a competitive alternative to robust classification, especially for problems with limited number of classes.

CVJan 9, 2022Code
Invariance encoding in sliced-Wasserstein space for image classification with limited training data

Mohammad Shifat E Rabbi, Yan Zhuang, Shiying Li et al.

Deep convolutional neural networks (CNNs) are broadly considered to be state-of-the-art generic end-to-end image classification systems. However, they are known to underperform when training data are limited and thus require data augmentation strategies that render the method computationally expensive and not always effective. Rather than using a data augmentation strategy to encode invariances as typically done in machine learning, here we propose to mathematically augment a nearest subspace classification model in sliced-Wasserstein space by exploiting certain mathematical properties of the Radon Cumulative Distribution Transform (R-CDT), a recently introduced image transform. We demonstrate that for a particular type of learning problem, our mathematical solution has advantages over data augmentation with deep CNNs in terms of classification accuracy and computational complexity, and is particularly effective under a limited training data setting. The method is simple, effective, computationally efficient, non-iterative, and requires no parameters to be tuned. Python code implementing our method is available at https://github.com/rohdelab/mathematical_augmentation. Our method is integrated as a part of the software package PyTransKit, which is available at https://github.com/rohdelab/PyTransKit.

CVApr 7, 2020Code
Radon cumulative distribution transform subspace modeling for image classification

Mohammad Shifat-E-Rabbi, Xuwang Yin, Abu Hasnat Mohammad Rubaiyat et al.

We present a new supervised image classification method applicable to a broad class of image deformation models. The method makes use of the previously described Radon Cumulative Distribution Transform (R-CDT) for image data, whose mathematical properties are exploited to express the image data in a form that is more suitable for machine learning. While certain operations such as translation, scaling, and higher-order transformations are challenging to model in native image space, we show the R-CDT can capture some of these variations and thus render the associated image classification problems easier to solve. The method -- utilizing a nearest-subspace algorithm in R-CDT space -- is simple to implement, non-iterative, has no hyper-parameters to tune, is computationally efficient, label efficient, and provides competitive accuracies to state-of-the-art neural networks for many types of classification problems. In addition to the test accuracy performances, we show improvements (with respect to neural network-based methods) in terms of computational efficiency (it can be implemented without the use of GPUs), number of training samples needed for training, as well as out-of-distribution generalization. The Python code for reproducing our results is available at https://github.com/rohdelab/rcdt_ns_classifier.

LGFeb 12, 2025
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Mantas Mazeika, Xuwang Yin, Rishub Tamirisa et al.

As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.

LGMar 5, 2025
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

Richard Ren, Arunim Agarwal, Mantas Mazeika et al.

As large language models (LLMs) become more capable and agentic, the requirement for trust in their outputs grows significantly, yet at the same time concerns have been mounting that models may learn to lie in pursuit of their goals. To address these concerns, a body of work has emerged around the notion of "honesty" in LLMs, along with interventions aimed at mitigating deceptive behaviors. However, evaluations of honesty are currently highly limited, with no benchmark combining large scale and applicability to all models. Moreover, many benchmarks claiming to measure honesty in fact simply measure accuracy--the correctness of a model's beliefs--in disguise. In this work, we introduce a large-scale human-collected dataset for measuring honesty directly, allowing us to disentangle accuracy from honesty for the first time. Across a diverse set of LLMs, we find that while larger models obtain higher accuracy on our benchmark, they do not become more honest. Surprisingly, while most frontier LLMs obtain high scores on truthfulness benchmarks, we find a substantial propensity in frontier LLMs to lie when pressured to do so, resulting in low honesty scores on our benchmark. We find that simple methods, such as representation engineering interventions, can improve honesty. These results underscore the growing need for robust evaluations and effective interventions to ensure LLMs remain trustworthy.

LGOct 13, 2025
Joint Discriminative-Generative Modeling via Dual Adversarial Training

Xuwang Yin, Claire Zhang, Julie Steele et al.

Simultaneously achieving robust classification and high-fidelity generative modeling within a single framework presents a significant challenge. Hybrid approaches, such as Joint Energy-Based Models (JEM), interpret classifiers as EBMs but are often limited by the instability and poor sample quality inherent in SGLD-based training. We address these limitations by proposing a novel training framework that integrates adversarial training (AT) principles for both discriminative robustness and stable generative learning. The proposed method introduces three key innovations: (1) the replacement of SGLD-based JEM learning with a stable, AT-based approach that optimizes the energy function by discriminating between real data and PGD-generated contrastive samples using the BCE loss; (2) synergistic adversarial training for the discriminative component that enhances classification robustness while eliminating the need for explicit gradient penalties; and (3) a two-stage training procedure to resolve the incompatibility between batch normalization and EBM training. Experiments on CIFAR-10, CIFAR-100, and ImageNet demonstrate that our method substantially improves adversarial robustness over existing hybrid models while maintaining competitive generative performance. On ImageNet, when optimized for generative modeling, our model's generative fidelity surpasses that of BigGAN and approaches diffusion models, representing the first MCMC-based EBM approach to achieve high-quality generation on complex, high-resolution datasets. Our approach addresses key stability issues that have limited JEM scaling and demonstrates that adversarial training can serve as an effective foundation for unified frameworks capable of generating and robustly classifying visual data.

CLNov 12, 2023
Learning Globally Optimized Language Structure via Adversarial Training

Xuwang Yin

Recent work has explored integrating autoregressive language models with energy-based models (EBMs) to enhance text generation capabilities. However, learning effective EBMs for text is challenged by the discrete nature of language. This work proposes an adversarial training strategy to address limitations in prior efforts. Specifically, an iterative adversarial attack algorithm is presented to generate negative samples for training the EBM by perturbing text from the autoregressive model. This aims to enable the EBM to suppress spurious modes outside the support of the data distribution. Experiments on an arithmetic sequence generation task demonstrate that the proposed adversarial training approach can substantially enhance the quality of generated sequences compared to prior methods. The results highlight the promise of adversarial techniques to improve discrete EBM training. Key contributions include: (1) an adversarial attack strategy tailored to text to generate negative samples, circumventing MCMC limitations; (2) an adversarial training algorithm for EBMs leveraging these attacks; (3) empirical validation of performance improvements on a sequence generation task.

CVFeb 22, 2022
Local Sliced-Wasserstein Feature Sets for Illumination-invariant Face Recognition

Yan Zhuang, Shiying Li, Mohammad Shifat-E-Rabbi et al.

We present a new method for face recognition from digital images acquired under varying illumination conditions. The method is based on mathematical modeling of local gradient distributions using the Radon Cumulative Distribution Transform (R-CDT). We demonstrate that lighting variations cause certain types of deformations of local image gradient distributions which, when expressed in R-CDT domain, can be modeled as a subspace. Face recognition is then performed using a nearest subspace in R-CDT domain of local gradient distributions. Experiment results demonstrate the proposed method outperforms other alternatives in several face recognition tasks with challenging illumination conditions. Python code implementing the proposed method is available, which is integrated as a part of the software package PyTransKit.

LGDec 11, 2020
Learning Energy-Based Models With Adversarial Training

Xuwang Yin, Shiying Li, Gustavo K. Rohde

We study a new approach to learning energy-based models (EBMs) based on adversarial training (AT). We show that (binary) AT learns a special kind of energy function that models the support of the data distribution, and the learning process is closely related to MCMC-based maximum likelihood learning of EBMs. We further propose improved techniques for generative modeling with AT, and demonstrate that this new approach is capable of generating diverse and realistic images. Aside from having competitive image generation performance to explicit EBMs, the studied approach is stable to train, is well-suited for image translation tasks, and exhibits strong out-of-distribution adversarial robustness. Our results demonstrate the viability of the AT approach to generative modeling, suggesting that AT is a competitive alternative approach to learning EBMs.

LGAug 21, 2019
Testing Robustness Against Unforeseen Adversaries

Max Kaufmann, Daniel Kang, Yi Sun et al.

Adversarial robustness research primarily focuses on L_p perturbations, and most defenses are developed with identical training-time and test-time adversaries. However, in real-world applications developers are unlikely to have access to the full range of attacks or corruptions their system will face. Furthermore, worst-case inputs are likely to be diverse and need not be constrained to the L_p ball. To narrow in on this discrepancy between research and reality we introduce ImageNet-UA, a framework for evaluating model robustness against a range of unforeseen adversaries, including eighteen new non-L_p attacks. To perform well on ImageNet-UA, defenses must overcome a generalization gap and be robust to a diverse attacks not encountered during training. In extensive experiments, we find that existing robustness measures do not capture unforeseen robustness, that standard robustness techniques are beat by alternative training strategies, and that novel methods can improve unforeseen robustness. We present ImageNet-UA as a useful tool for the community for improving the worst-case behavior of machine learning systems.

MLJul 4, 2019
Neural Networks, Hypersurfaces, and Radon Transforms

Soheil Kolouri, Xuwang Yin, Gustavo K. Rohde

Connections between integration along hypersufaces, Radon transforms, and neural networks are exploited to highlight an integral geometric mathematical interpretation of neural networks. By analyzing the properties of neural networks as operators on probability distributions for observed data, we show that the distribution of outputs for any node in a neural network can be interpreted as a nonlinear projection along hypersurfaces defined by level surfaces over the input data space. We utilize these descriptions to provide new interpretation for phenomena such as nonlinearity, pooling, activation functions, and adversarial examples in neural network-based learning problems.

LGMay 27, 2019
GAT: Generative Adversarial Training for Adversarial Example Detection and Robust Classification

Xuwang Yin, Soheil Kolouri, Gustavo K. Rohde

The vulnerabilities of deep neural networks against adversarial examples have become a significant concern for deploying these models in sensitive domains. Devising a definitive defense against such attacks is proven to be challenging, and the methods relying on detecting adversarial samples are only valid when the attacker is oblivious to the detection mechanism. In this paper we propose a principled adversarial example detection method that can withstand norm-constrained white-box attacks. Inspired by one-versus-the-rest classification, in a K class classification problem, we train K binary classifiers where the i-th binary classifier is used to distinguish between clean data of class i and adversarially perturbed samples of other classes. At test time, we first use a trained classifier to get the predicted label (say k) of the input, and then use the k-th binary classifier to determine whether the input is a clean sample (of class k) or an adversarially perturbed example (of other classes). We further devise a generative approach to detecting/classifying adversarial examples by interpreting each binary classifier as an unnormalized density model of the class-conditional data. We provide comprehensive evaluation of the above adversarial example detection/classification methods, and demonstrate their competitive performances and compelling properties.

CLDec 10, 2018
Chat-crowd: A Dialog-based Platform for Visual Layout Composition

Paola Cascante-Bonilla, Xuwang Yin, Vicente Ordonez et al.

In this paper we introduce Chat-crowd, an interactive environment for visual layout composition via conversational interactions. Chat-crowd supports multiple agents with two conversational roles: agents who play the role of a designer are in charge of placing objects in an editable canvas according to instructions or commands issued by agents with a director role. The system can be integrated with crowdsourcing platforms for both synchronous and asynchronous data collection and is equipped with comprehensive quality controls on the performance of both types of agents. We expect that this system will be useful to build multimodal goal-oriented dialog tasks that require spatial and geometric reasoning.

CRDec 7, 2018
Privacy Partitioning: Protecting User Data During the Deep Learning Inference Phase

Jianfeng Chi, Emmanuel Owusu, Xuwang Yin et al.

We present a practical method for protecting data during the inference phase of deep learning based on bipartite topology threat modeling and an interactive adversarial deep network construction. We term this approach \emph{Privacy Partitioning}. In the proposed framework, we split the machine learning models and deploy a few layers into users' local devices, and the rest of the layers into a remote server. We propose an approach to protect user's data during the inference phase, while still achieve good classification accuracy. We conduct an experimental evaluation of this approach on benchmark datasets of three computer vision tasks. The experimental results indicate that this approach can be used to significantly attenuate the capacity for an adversary with access to the state-of-the-art deep network's intermediate states to learn privacy-sensitive inputs to the network. For example, we demonstrate that our approach can prevent attackers from inferring the private attributes such as gender from the Face image dataset without sacrificing the classification accuracy of the original machine learning task such as Face Identification.

CVJul 22, 2017
OBJ2TEXT: Generating Visually Descriptive Language from Object Layouts

Xuwang Yin, Vicente Ordonez

Generating captions for images is a task that has recently received considerable attention. In this work we focus on caption generation for abstract scenes, or object layouts where the only information provided is a set of objects and their locations. We propose OBJ2TEXT, a sequence-to-sequence model that encodes a set of objects and their locations as an input sequence using an LSTM network, and decodes this representation using an LSTM language model. We show that our model, despite encoding object layouts as a sequence, can represent spatial relationships between objects, and generate descriptions that are globally coherent and semantically relevant. We test our approach in a task of object-layout captioning by using only object annotations as inputs. We additionally show that our model, combined with a state-of-the-art object detector, improves an image captioning model from 0.863 to 0.950 (CIDEr score) in the test benchmark of the standard MS-COCO Captioning task.

CVJan 11, 2013
Robust Text Detection in Natural Scene Images

Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang et al.

Text detection in natural scene images is an important prerequisite for many content-based image analysis tasks. In this paper, we propose an accurate and robust method for detecting texts in natural scene images. A fast and effective pruning algorithm is designed to extract Maximally Stable Extremal Regions (MSERs) as character candidates using the strategy of minimizing regularized variations. Character candidates are grouped into text candidates by the ingle-link clustering algorithm, where distance weights and threshold of the clustering algorithm are learned automatically by a novel self-training distance metric learning algorithm. The posterior probabilities of text candidates corresponding to non-text are estimated with an character classifier; text candidates with high probabilities are then eliminated and finally texts are identified with a text classifier. The proposed system is evaluated on the ICDAR 2011 Robust Reading Competition dataset; the f measure is over 76% and is significantly better than the state-of-the-art performance of 71%. Experimental results on a publicly available multilingual dataset also show that our proposed method can outperform the other competitive method with the f measure increase of over 9 percent. Finally, we have setup an online demo of our proposed scene text detection system at http://kems.ustb.edu.cn/learning/yin/dtext.