Eric Hsin

CLDec 18, 2025Code

Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

Kai Hu, Abhinav Aggarwal, Mehran Khodabandeh et al.

This paper introduces Jailbreak-Zero, a novel red teaming methodology that shifts the paradigm of Large Language Model (LLM) safety evaluation from a constrained example-based approach to a more expansive and effective policy-based framework. By leveraging an attack LLM to generate a high volume of diverse adversarial prompts and then fine-tuning this attack model with a preference dataset, Jailbreak-Zero achieves Pareto optimality across the crucial objectives of policy coverage, attack strategy diversity, and prompt fidelity to real user inputs. The empirical evidence demonstrates the superiority of this method, showcasing significantly higher attack success rates against both open-source and proprietary models like GPT-40 and Claude 3.5 when compared to existing state-of-the-art techniques. Crucially, Jailbreak-Zero accomplishes this while producing human-readable and effective adversarial prompts with minimal need for human intervention, thereby presenting a more scalable and comprehensive solution for identifying and mitigating the safety vulnerabilities of LLMs.

CVDec 14, 2019

Deep Poisoning: Towards Robust Image Data Sharing against Visual Disclosure

Hao Guo, Brian Dolhansky, Eric Hsin et al.

Due to respectively limited training data, different entities addressing the same vision task based on certain sensitive images may not train a robust deep network. This paper introduces a new vision task where various entities share task-specific image data to enlarge each other's training data volume without visually disclosing sensitive contents (e.g. illegal images). Then, we present a new structure-based training regime to enable different entities learn task-specific and reconstruction-proof image representations for image data sharing. Specifically, each entity learns a private Deep Poisoning Module (DPM) and insert it to a pre-trained deep network, which is designed to perform the specific vision task. The DPM deliberately poisons convolutional image features to prevent image reconstructions, while ensuring that the altered image data is functionally equivalent to the non-poisoned data for the specific vision task. Given this equivalence, the poisoned features shared from one entity could be used by another entity for further model refinement. Experimental results on image classification prove the efficacy of the proposed method.

Eric Hsin

2 Papers