Chaoyi Zhu

LG
h-index6
3papers
170citations
Novelty68%
AI Score36

3 Papers

LGMar 12, 2024
Duwak: Dual Watermarks in Large Language Models

Chaoyi Zhu, Jeroen Galjaard, Pin-Yu Chen et al.

As large language models (LLM) are increasingly used for text generation tasks, it is critical to audit their usages, govern their applications, and mitigate their potential harms. Existing watermark techniques are shown effective in embedding single human-imperceptible and machine-detectable patterns without significantly affecting generated text quality and semantics. However, the efficiency in detecting watermarks, i.e., the minimum number of tokens required to assert detection with significance and robustness against post-editing, is still debatable. In this paper, we propose, Duwak, to fundamentally enhance the efficiency and quality of watermarking by embedding dual secret patterns in both token probability distribution and sampling schemes. To mitigate expression degradation caused by biasing toward certain tokens, we design a contrastive search to watermark the sampling scheme, which minimizes the token repetition and enhances the diversity. We theoretically explain the interdependency of the two watermarks within Duwak. We evaluate Duwak extensively on Llama2 under various post-editing attacks, against four state-of-the-art watermarking techniques and combinations of them. Our results show that Duwak marked text achieves the highest watermarked text quality at the lowest required token count for detection, up to 70% tokens less than existing approaches, especially under post paraphrasing.

LGMar 12, 2024
DP-TLDM: Differentially Private Tabular Latent Diffusion Model

Chaoyi Zhu, Jiayi Tang, Juan F. Pérez et al.

Synthetic data from generative models emerges as the privacy-preserving data sharing solution. Such a synthetic data set shall resemble the original data without revealing identifiable private information. Till date, the prior focus on limited types of tabular synthesizers and a small number of privacy attacks, particularly on Generative Adversarial Networks, and overlooks membership inference attacks and defense strategies, i.e., differential privacy. Motivated by the conundrum of keeping high data quality and low privacy risk of synthetic data tables, we propose DPTLDM, Differentially Private Tabular Latent Diffusion Model, which is composed of an autoencoder network to encode the tabular data and a latent diffusion model to synthesize the latent tables. Following the emerging f-DP framework, we apply DP-SGD to train the auto-encoder in combination with batch clipping and use the separation value as the privacy metric to better capture the privacy gain from DP algorithms. Our empirical evaluation demonstrates that DPTLDM is capable of achieving a meaningful theoretical privacy guarantee while also significantly enhancing the utility of synthetic data. Specifically, compared to other DP-protected tabular generative models, DPTLDM improves the synthetic quality by an average of 35% in data resemblance, 15% in the utility for downstream tasks, and 50% in data discriminability, all while preserving a comparable level of privacy risk.

MTRL-SCIFeb 10, 2019
Paradigm shift in electron-based crystallography via machine learning

Kevin Kaufmann, Chaoyi Zhu, Alexander S. Rosengarten et al.

Accurately determining the crystallographic structure of a material, organic or inorganic, is a critical primary step in material development and analysis. The most common practices involve analysis of diffraction patterns produced in laboratory XRD, TEM, and synchrotron X-ray sources. However, these techniques are slow, require careful sample preparation, can be difficult to access, and are prone to human error during analysis. This paper presents a newly developed methodology that represents a paradigm change in electron diffraction-based structure analysis techniques, with the potential to revolutionize multiple crystallography-related fields. A machine learning-based approach for rapid and autonomous identification of the crystal structure of metals and alloys, ceramics, and geological specimens, without any prior knowledge of the sample, is presented and demonstrated utilizing the electron backscatter diffraction (EBSD) technique. Electron backscatter diffraction patterns are collected from materials with well-known crystal structures, then a deep neural network model is constructed for classification to a specific Bravais lattice or point group. The applicability of this approach is evaluated on diffraction patterns from samples unknown to the computer without any human input or data filtering. This is in comparison to traditional Hough transform EBSD, which requires that you have already determined the phases present in your sample. The internal operations of the neural network are elucidated through visualizing the symmetry features learned by the convolutional neural network. It is determined that the model looks for the same features a crystallographer would use, even though it is not explicitly programmed to do so. This study opens the door to fully automated, high-throughput determination of crystal structures via several electron-based diffraction techniques.