Zilong Zhao

LG
h-index44
29papers
797citations
Novelty49%
AI Score50

29 Papers

LGOct 19, 2023Code
TabuLa: Harnessing Language Models for Tabular Data Synthesis

Zilong Zhao, Robert Birke, Lydia Chen

Tabular data synthesis is crucial for addressing privacy and security concerns in industries reliant on tabular data. While recent advancements adopt large language models (LLMs) for realistic tabular data generation, their long training times and limited reusability hinder practical applications. In this paper, we propose Tabula, a tabular data synthesizer that leverages the structure of LLM. Unlike state-of-the-art (SOTA) LLM-based tabular data synthesizers that rely on pre-trained LLMs, Tabula discards the pre-trained weights originally designed for natural language tasks, focusing instead on a tailored approach for tabular data. In addition, Tabula introduces a token sequence compression strategy that significantly reduces training time while maintaining data quality, alongside a novel token padding method that improves sequence alignment across training batches. Experiments on six datasets show that Tabula achieves superior synthetic data utility compared to current SOTA methods. Additionally, the results demonstrate that Tabula model trained on tabular datasets serves effectively as a foundational model for synthesizing new tabular datasets. Furthermore, the proposed padding method outperforms the conventional left and right padding strategies. Finally, the results highlight that Tabula averagely reduces training time per epoch by 46.2% compared to state-of-the-art LLM approaches while achieving higher data utility. Our code is available at https://github.com/zhao-zilong/Tabula

LGApr 1, 2022
CTAB-GAN+: Enhancing Tabular Data Synthesis

Zilong Zhao, Aditya Kunar, Robert Birke et al.

While data sharing is crucial for knowledge development, privacy concerns and strict regulation (e.g., European General Data Protection Regulation (GDPR)) limit its full effectiveness. Synthetic tabular data emerges as alternative to enable data sharing while fulfilling regulatory and privacy constraints. State-of-the-art tabular data synthesizers draw methodologies from Generative Adversarial Networks (GAN). As GANs improve the synthesized data increasingly resemble the real data risking to leak privacy. Differential privacy (DP) provides theoretical guarantees on privacy loss but degrades data utility. Striking the best trade-off remains yet a challenging research question. We propose CTAB-GAN+ a novel conditional tabular GAN. CTAB-GAN+ improves upon state-of-the-art by (i) adding downstream losses to conditional GANs for higher utility synthetic data in both classification and regression domains; (ii) using Wasserstein loss with gradient penalty for better training convergence; (iii) introducing novel encoders targeting mixed continuous-categorical variables and variables with unbalanced or skewed data; and (iv) training with DP stochastic gradient descent to impose strict privacy guarantees. We extensively evaluate CTAB-GAN+ on data similarity and analysis utility against state-of-the-art tabular GANs. The results show that CTAB-GAN+ synthesizes privacy-preserving data with at least 48.16% higher utility across multiple datasets and learning tasks under different privacy budgets.

CLMay 21, 2025
Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought

Tencent Hunyuan Team, Ao Liu, Botong Zhou et al. · tencent-ai

As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid responses for simple queries and deep "thinking" modes for complex problems, optimizing computational resources. Architecturally, this 56B activated (560B total) parameter model employs 128 layers (Mamba2, Attention, FFN) with an innovative AMF/MF block pattern. Faster Mamba2 ensures linear complexity, Grouped-Query Attention minimizes KV cache, and FFNs use an MoE structure. Pre-trained on 16T high-quality tokens, it supports a 256K context length and is the first industry-deployed large-scale Mamba model. Our comprehensive post-training strategy enhances capabilities via Supervised Fine-Tuning (3M instructions), a novel Adaptive Long-short CoT Fusion method, Multi-round Deliberation Learning for iterative improvement, and a two-stage Large-scale Reinforcement Learning process targeting STEM and general instruction-following. Evaluations show strong performance: overall top 7 rank on LMSYS Chatbot Arena with a score of 1356, outperforming leading models like Gemini-2.0-Flash-001 (1352) and o4-mini-2025-04-16 (1345). TurboS also achieves an average of 77.9% across 23 automated benchmarks. Hunyuan-TurboS balances high performance and efficiency, offering substantial capabilities at lower inference costs than many reasoning models, establishing a new paradigm for efficient large-scale pre-trained models.

LGFeb 3, 2023
VT-GAN: Cooperative Tabular Data Synthesis using Vertical Federated Learning

Zilong Zhao, Han Wu, Aad Van Moorsel et al.

This paper presents the application of Vertical Federated Learning (VFL) to generate synthetic tabular data using Generative Adversarial Networks (GANs). VFL is a collaborative approach to train machine learning models among distinct tabular data holders, such as financial institutions, who possess disjoint features for the same group of customers. In this paper we introduce the VT-GAN framework, Vertical federated Tabular GAN, and demonstrate that VFL can be successfully used to implement GANs for distributed tabular data in privacy-preserving manner, with performance close to centralized GANs that assume shared data. We make design choices with respect to the distribution of GAN generator and discriminator models and introduce a training-with-shuffling technique so that no party can reconstruct training data from the GAN conditional vector. The paper presents (1) an implementation of VT-GAN, (2) a detailed quality evaluation of the VT-GAN-generated synthetic data, (3) an overall scalability examination of VT-GAN framework, (4) a security analysis on VT-GAN's robustness against Membership Inference Attack with different settings of Differential Privacy, for a range of datasets with diverse distribution characteristics. Our results demonstrate that VT-GAN can consistently generate high-fidelity synthetic tabular data of comparable quality to that generated by a centralized GAN algorithm. The difference in machine learning utility can be as low as 2.7%, even under extremely imbalanced data distributions across clients or with different numbers of clients.

LGOct 12, 2022
FCT-GAN: Enhancing Table Synthesis via Fourier Transform

Zilong Zhao, Robert Birke, Lydia Y. Chen

Synthetic tabular data emerges as an alternative for sharing knowledge while adhering to restrictive data access regulations, e.g., European General Data Protection Regulation (GDPR). Mainstream state-of-the-art tabular data synthesizers draw methodologies from Generative Adversarial Networks (GANs), which are composed of a generator and a discriminator. While convolution neural networks are shown to be a better architecture than fully connected networks for tabular data synthesizing, two key properties of tabular data are overlooked: (i) the global correlation across columns, and (ii) invariant synthesizing to column permutations of input data. To address the above problems, we propose a Fourier conditional tabular generative adversarial network (FCT-GAN). We introduce feature tokenization and Fourier networks to construct a transformer-style generator and discriminator, and capture both local and global dependencies across columns. The tokenizer captures local spatial features and transforms original data into tokens. Fourier networks transform tokens to frequency domains and element-wisely multiply a learnable filter. Extensive evaluation on benchmarks and real-world data shows that FCT-GAN can synthesize tabular data with high machine learning utility (up to 27.8% better than state-of-the-art baselines) and high statistical similarity to the original data (up to 26.5% better), while maintaining the global correlation across columns, especially on high dimensional dataset.

LGNov 17, 2022
Permutation-Invariant Tabular Data Synthesis

Yujin Zhu, Zilong Zhao, Robert Birke et al.

Tabular data synthesis is an emerging approach to circumvent strict regulations on data privacy while discovering knowledge through big data. Although state-of-the-art AI-based tabular data synthesizers, e.g., table-GAN, CTGAN, TVAE, and CTAB-GAN, are effective at generating synthetic tabular data, their training is sensitive to column permutations of input data. In this paper, we first conduct an extensive empirical study to disclose such a property of permutation invariance and an in-depth analysis of the existing synthesizers. We show that changing the input column order worsens the statistical difference between real and synthetic data by up to 38.67% due to the encoding of tabular data and the network architectures. To fully unleash the potential of big synthetic tabular data, we propose two solutions: (i) AE-GAN, a synthesizer that uses an autoencoder network to represent the tabular data and GAN networks to synthesize the latent representation, and (ii) a feature sorting algorithm to find the suitable column order of input data for CNN-based synthesizers. We evaluate the proposed solutions on five datasets in terms of the sensitivity to the column permutation, the quality of synthetic data, and the utility in downstream analyses. Our results show that we enhance the property of permutation-invariance when training synthesizers and further improve the quality and utility of synthetic data, up to 22%, compared to the existing synthesizers.

LGSep 5, 2024
VFLGAN-TS: Vertical Federated Learning-based Generative Adversarial Networks for Publication of Vertically Partitioned Time-Series Data

Xun Yuan, Zilong Zhao, Prosanta Gope et al.

In the current artificial intelligence (AI) era, the scale and quality of the dataset play a crucial role in training a high-quality AI model. However, often original data cannot be shared due to privacy concerns and regulations. A potential solution is to release a synthetic dataset with a similar distribution to the private dataset. Nevertheless, in some scenarios, the attributes required to train an AI model are distributed among different parties, and the parties cannot share the local data for synthetic data construction due to privacy regulations. In PETS 2024, we recently introduced the first Vertical Federated Learning-based Generative Adversarial Network (VFLGAN) for publishing vertically partitioned static data. However, VFLGAN cannot effectively handle time-series data, presenting both temporal and attribute dimensions. In this article, we proposed VFLGAN-TS, which combines the ideas of attribute discriminator and vertical federated learning to generate synthetic time-series data in the vertically partitioned scenario. The performance of VFLGAN-TS is close to that of its counterpart, which is trained in a centralized manner and represents the upper limit for VFLGAN-TS. To further protect privacy, we apply a Gaussian mechanism to make VFLGAN-TS satisfy an $(ε,δ)$-differential privacy. Besides, we develop an enhanced privacy auditing scheme to evaluate the potential privacy breach through the framework of VFLGAN-TS and synthetic datasets.

LGDec 3, 2025
Towards Irreversible Machine Unlearning for Diffusion Models

Xun Yuan, Zilong Zhao, Jiayu Li et al.

Diffusion models are renowned for their state-of-the-art performance in generating synthetic images. However, concerns related to safety, privacy, and copyright highlight the need for machine unlearning, which can make diffusion models forget specific training data and prevent the generation of sensitive or unwanted content. Current machine unlearning methods for diffusion models are primarily designed for conditional diffusion models and focus on unlearning specific data classes or features. Among these methods, finetuning-based machine unlearning methods are recognized for their efficiency and effectiveness, which update the parameters of pre-trained diffusion models by minimizing carefully designed loss functions. However, in this paper, we propose a novel attack named Diffusion Model Relearning Attack (DiMRA), which can reverse the finetuning-based machine unlearning methods, posing a significant vulnerability of this kind of technique. Without prior knowledge of the unlearning elements, DiMRA optimizes the unlearned diffusion model on an auxiliary dataset to reverse the unlearning, enabling the model to regenerate previously unlearned elements. To mitigate this vulnerability, we propose a novel machine unlearning method for diffusion models, termed as Diffusion Model Unlearning by Memorization (DiMUM). Unlike traditional methods that focus on forgetting, DiMUM memorizes alternative data or features to replace targeted unlearning data or features in order to prevent generating such elements. In our experiments, we demonstrate the effectiveness of DiMRA in reversing state-of-the-art finetuning-based machine unlearning methods for diffusion models, highlighting the need for more robust solutions. We extensively evaluate DiMUM, demonstrating its superior ability to preserve the generative performance of diffusion models while enhancing robustness against DiMRA.

CLNov 4, 2024Code
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Xingwu Sun, Yanfeng Chen, Yiqing Huang et al. · tencent-ai

In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: https://github.com/Tencent/Hunyuan-Large Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large

LGSep 25, 2024
CombU: A Combined Unit Activation for Fitting Mathematical Expressions with Neural Networks

Jiayu Li, Zilong Zhao, Kevin Yee et al.

The activation functions are fundamental to neural networks as they introduce non-linearity into data relationships, thereby enabling deep networks to approximate complex data relations. Existing efforts to enhance neural network performance have predominantly focused on developing new mathematical functions. However, we find that a well-designed combination of existing activation functions within a neural network can also achieve this objective. In this paper, we introduce the Combined Units activation (CombU), which employs different activation functions at various dimensions across different layers. This approach can be theoretically proven to fit most mathematical expressions accurately. The experiments conducted on four mathematical expression datasets, compared against six State-Of-The-Art (SOTA) activation function algorithms, demonstrate that CombU outperforms all SOTA algorithms in 10 out of 16 metrics and ranks in the top three for the remaining six metrics.

CVOct 11, 2022
UGformer for Robust Left Atrium and Scar Segmentation Across Scanners

Tianyi Liu, Size Hou, Jiayuan Zhu et al.

Thanks to the capacity for long-range dependencies and robustness to irregular shapes, vision transformers and deformable convolutions are emerging as powerful vision techniques of segmentation.Meanwhile, Graph Convolution Networks (GCN) optimize local features based on global topological relationship modeling. Particularly, they have been proved to be effective in addressing issues in medical imaging segmentation tasks including multi-domain generalization for low-quality images. In this paper, we present a novel, effective, and robust framework for medical image segmentation, namely, UGformer. It unifies novel transformer blocks, GCN bridges, and convolution decoders originating from U-Net to predict left atriums (LAs) and LA scars. We have identified two appealing findings of the proposed UGformer: 1). an enhanced transformer module with deformable convolutions to improve the blending of the transformer information with convolutional information and help predict irregular LAs and scar shapes. 2). Using a bridge incorporating GCN to further overcome the difficulty of capturing condition inconsistency across different Magnetic Resonance Images scanners with various inconsistent domain information. The proposed UGformer model exhibits outstanding ability to segment the left atrium and scar on the LAScarQS 2022 dataset, outperforming several recent state-of-the-arts.

CVMar 2Code
MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention

Zilong Zhao, Zhengming Ding, Pei Niu et al.

Feature encoders play a key role in pixel-level crack segmentation by shaping the representation of fine textures and thin structures. Existing CNN-, Transformer-, and Mamba-based models each capture only part of the required spatial or structural information, leaving clear gaps in modeling complex crack patterns. To address this, we present MixerCSeg, a mixer architecture designed like a coordinated team of specialists, where CNN-like pathways focus on local textures, Transformer-style paths capture global dependencies, and Mamba-inspired flows model sequential context within a single encoder. At the core of MixerCSeg is the TransMixer, which explores Mamba's latent attention behavior while establishing dedicated pathways that naturally express both locality and global awareness. To further enhance structural fidelity, we introduce a spatial block processing strategy and a Direction-guided Edge Gated Convolution (DEGConv) that strengthens edge sensitivity under irregular crack geometries with minimal computational overhead. A Spatial Refinement Multi-Level Fusion (SRF) module is then employed to refine multi-scale details without increasing complexity. Extensive experiments on multiple crack segmentation benchmarks show that MixerCSeg achieves state-of-the-art performance with only 2.05 GFLOPs and 2.54 M parameters, demonstrating both efficiency and strong representational capability. The code is available at https://github.com/spiderforest/MixerCSeg.

AIFeb 24, 2024Code
Stepwise Self-Consistent Mathematical Reasoning with Large Language Models

Zilong Zhao, Yao Rong, Dongyang Guo et al.

Using Large Language Models for complex mathematical reasoning is difficult, primarily due to the complexity of multi-step reasoning. The main challenges of this process include (1) selecting critical intermediate results to advance the procedure, and (2) limited exploration of potential solutions. To address these issues, we introduce a novel algorithm, namely Stepwise Self-Consistent Chain-of-Thought (SSC-CoT). SSC-CoT employs a strategy of selecting intermediate steps based on the intersection of various reasoning chains. Additionally, SSC-CoT enables the model to discover critical intermediate steps by querying a knowledge graph comprising relevant domain knowledge. To validate SSC-CoT, we present a new dataset, TriMaster100, tailored for complex trigonometry problems. This dataset contains 100 questions, with each solution broken down into scored intermediate steps, facilitating a comprehensive evaluation of the mathematical reasoning process. On TriMaster100, SSC-CoT triples the effectiveness of the state-of-the-art methods. Furthermore, we benchmark SSC-CoT on the widely recognized complex mathematical question dataset, MATH level 5, and it surpasses the second-best method by 7.2% in accuracy. Code and the TriMaster100 dataset can be found at: https://github.com/zhao-zilong/ssc-cot.

CVNov 28, 2025Code
Instruction Tuning of Large Language Models for Tabular Data Generation-in One Day

Milad Abdollahzadeh, Abdul Raheem, Zilong Zhao et al.

Tabular instruction tuning has emerged as a promising research direction for improving LLMs understanding of tabular data. However, the majority of existing works only consider question-answering and reasoning tasks over tabular data, leaving tabular data generation largely unnoticed. In this work, for the first time, we explore the efficacy of instruction tuning in improving LLMs tabular data generation capabilities. More specifically, given the high data and computation requirements of tabular instruction tuning, we aim to address the possibility of instruction tuning for tabular data generation with limited data and computational resources. To achieve this, we first create a high-quality instruction dataset for tabular data, enabling efficient LLM comprehension. We then instruction-tune an open-source LLM (Llama3.1-8B-Instruct) on the training set of this dataset to improve its tabular data generation performance. Our experimental results show that by using our high-quality dataset and instruction-tuning on only 7K instructions with an A100 GPU, for less than 6 hours, we achieve tabular data generation performance on par with the most capable commercial LLM, GPT-4o.

DBDec 23, 2023Code
IRG: Generating Synthetic Relational Databases using Deep Learning with Insightful Relational Understanding

Jiayu Li, Zilong Zhao, Vikram Chundawat et al.

Synthetic data has numerous applications, including but not limited to software testing at scale, privacy-preserving data sharing to enable smoother collaboration between stakeholders, and data augmentation for analytical and machine learning tasks. Relational databases, which are commonly used by corporations, governments, and financial institutions, present unique challenges for synthetic data generation due to their complex structures. Existing synthetic relational database generation approaches often assume idealized scenarios, such as every table having a perfect primary key column without composite and potentially overlapping primary or foreign key constraints, and fail to account for the sequential nature of certain tables. In this paper, we propose incremental relational generator (IRG), that successfully handles these ubiquitous real-life situations. IRG ensures the preservation of relational schema integrity, offers a deep contextual understanding of relationships beyond direct ancestors and descendants, leverages the power of newly designed deep neural networks, and scales efficiently to handle larger datasets--a combination never achieved in previous works. Experiments on three open-source real-life relational datasets in different fields at different scales demonstrate IRG's advantage in maintaining the synthetic data's relational schema validity and data fidelity and utility.

LGAug 18, 2021Code
Fed-TGAN: Federated Learning Framework for Synthesizing Tabular Data

Zilong Zhao, Robert Birke, Aditya Kunar et al.

Generative Adversarial Networks (GANs) are typically trained to synthesize data, from images and more recently tabular data, under the assumption of directly accessible training data. Recently, federated learning (FL) is an emerging paradigm that features decentralized learning on client's local data with a privacy-preserving capability. And, while learning GANs to synthesize images on FL systems has just been demonstrated, it is unknown if GANs for tabular data can be learned from decentralized data sources. Moreover, it remains unclear which distributed architecture suits them best. Different from image GANs, state-of-the-art tabular GANs require prior knowledge on the data distribution of each (discrete and continuous) column to agree on a common encoding -- risking privacy guarantees. In this paper, we propose Fed-TGAN, the first Federated learning framework for Tabular GANs. To effectively learn a complex tabular GAN on non-identical participants, Fed-TGAN designs two novel features: (i) a privacy-preserving multi-source feature encoding for model initialization; and (ii) table similarity aware weighting strategies to aggregate local models for countering data skew. We extensively evaluate the proposed Fed-TGAN against variants of decentralized learning architectures on four widely used datasets. Results show that Fed-TGAN accelerates training time per epoch up to 200% compared to the alternative architectures, for both IID and Non-IID data. Overall, Fed-TGAN not only stabilizes the training loss, but also achieves better similarity between generated and original data. Our code is released at https://github.com/zhao-zilong/Fed-TGAN.

CRMar 26, 2025
Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road Ahead

Viktor Schlegel, Anil A Bharath, Zilong Zhao et al.

Privacy-preserving synthetic data offers a promising solution to harness segregated data in high-stakes domains where information is compartmentalized for regulatory, privacy, or institutional reasons. This survey provides a comprehensive framework for understanding the landscape of privacy-preserving synthetic data, presenting the theoretical foundations of generative models and differential privacy followed by a review of state-of-the-art methods across tabular data, images, and text. Our synthesis of evaluation approaches highlights the fundamental trade-off between utility for down-stream tasks and privacy guarantees, while identifying critical research gaps: the lack of realistic benchmarks representing specialized domains and insufficient empirical evaluations required to contextualise formal guarantees. Through empirical analysis of four leading methods on five real-world datasets from specialized domains, we demonstrate significant performance degradation under realistic privacy constraints ($ε\leq 4$), revealing a substantial gap between results reported on general domain benchmarks and performance on domain-specific data. %Our findings highlight key challenges including unaccounted privacy leakage, insufficient empirical verification of formal guarantees, and a critical deficit of realistic benchmarks. These challenges underscore the need for robust evaluation frameworks, standardized benchmarks for specialized domains, and improved techniques to address the unique requirements of privacy-sensitive fields such that this technology can deliver on its considerable potential.

LGJan 2, 2025
TabTreeFormer: Tabular Data Generation Using Hybrid Tree-Transformer

Jiayu Li, Bingyin Zhao, Zilong Zhao et al.

Transformers have shown impressive results in tabular data generation. However, they lack domain-specific inductive biases which are critical for preserving the intrinsic characteristics of tabular data. They also suffer from poor scalability and efficiency due to quadratic computational complexity. In this paper, we propose TabTreeFormer, a hybrid transformer architecture that integrates inductive biases of tree-based models (i.e., non-smoothness and non-rotational invariance) to effectively handle the discrete and weakly correlated features in tabular datasets. To improve numerical fidelity and capture multimodal distributions, we introduce a novel tokenizer that learns token sequences based on the complexity of tabular values. This reduces vocabulary size and sequence length, yielding more compact and efficient representations without sacrificing performance. We evaluate TabTreeFormer on nine diverse datasets, benchmarking against eight generative models. We show that TabTreeFormer consistently outperforms baselines in utility, fidelity, and privacy metrics with competitive efficiency. Notably, in scenarios prioritizing data utility over privacy and efficiency, the best variant of TabTreeFormer delivers a 44% performance gain relative to its baseline variant.

LGApr 27, 2024
TabVFL: Improving Latent Representation in Vertical Federated Learning

Mohamed Rashad, Zilong Zhao, Jeremie Decouchant et al.

Autoencoders are popular neural networks that are able to compress high dimensional data to extract relevant latent information. TabNet is a state-of-the-art neural network model designed for tabular data that utilizes an autoencoder architecture for training. Vertical Federated Learning (VFL) is an emerging distributed machine learning paradigm that allows multiple parties to train a model collaboratively on vertically partitioned data while maintaining data privacy. The existing design of training autoencoders in VFL is to train a separate autoencoder in each participant and aggregate the latent representation later. This design could potentially break important correlations between feature data of participating parties, as each autoencoder is trained on locally available features while disregarding the features of others. In addition, traditional autoencoders are not specifically designed for tabular data, which is ubiquitous in VFL settings. Moreover, the impact of client failures during training on the model robustness is under-researched in the VFL scene. In this paper, we propose TabVFL, a distributed framework designed to improve latent representation learning using the joint features of participants. The framework (i) preserves privacy by mitigating potential data leakage with the addition of a fully-connected layer, (ii) conserves feature correlations by learning one latent representation vector, and (iii) provides enhanced robustness against client failures during training phase. Extensive experiments on five classification datasets show that TabVFL can outperform the prior work design, with 26.12% of improvement on f1-score.

CRFeb 7, 2022
Fabricated Flips: Poisoning Federated Learning without Data

Jiyue Huang, Zilong Zhao, Lydia Y. Chen et al.

Attacks on Federated Learning (FL) can severely reduce the quality of the generated models and limit the usefulness of this emerging learning paradigm that enables on-premise decentralized learning. However, existing untargeted attacks are not practical for many scenarios as they assume that i) the attacker knows every update of benign clients, or ii) the attacker has a large dataset to locally train updates imitating benign parties. In this paper, we propose a data-free untargeted attack (DFA) that synthesizes malicious data to craft adversarial models without eavesdropping on the transmission of benign clients at all or requiring a large quantity of task-specific training data. We design two variants of DFA, namely DFA-R and DFA-G, which differ in how they trade off stealthiness and effectiveness. Specifically, DFA-R iteratively optimizes a malicious data layer to minimize the prediction confidence of all outputs of the global model, whereas DFA-G interactively trains a malicious data generator network by steering the output of the global model toward a particular class. Experimental results on Fashion-MNIST, Cifar-10, and SVHN show that DFA, despite requiring fewer assumptions than existing attacks, achieves similar or even higher attack success rate than state-of-the-art untargeted attacks against various state-of-the-art defense mechanisms. Concretely, they can evade all considered defense mechanisms in at least 50% of the cases for CIFAR-10 and often reduce the accuracy by more than a factor of 2. Consequently, we design REFD, a defense specifically crafted to protect against data-free attacks. REFD leverages a reference dataset to detect updates that are biased or have a low confidence. It greatly improves upon existing defenses by filtering out the malicious updates and achieves high global model accuracy

CVJan 24, 2022
Attacks and Defenses for Free-Riders in Multi-Discriminator GAN

Zilong Zhao, Jiyue Huang, Stefanie Roos et al.

Generative Adversarial Networks (GANs) are increasingly adopted by the industry to synthesize realistic images. Due to data not being centrally available, Multi-Discriminator (MD)-GANs training framework employs multiple discriminators that have direct access to the real data. Distributedly training a joint GAN model entails the risk of free-riders, i.e., participants that aim to benefit from the common model while only pretending to participate in the training process. In this paper, we conduct the first characterization study of the impact of free-riders on MD-GAN. Based on two production prototypes of MD-GAN, we find that free-riders drastically reduce the ability of MD-GANs to produce images that are indistinguishable from real data, i.e., they increase the FID score -- the standard measure to assess the quality of generated images. To mitigate the model degradation, we propose a defense strategy against free-riders in MD-GAN, termed DFG. DFG distinguishes free-riders and benign participants through periodic probing and clustering of discriminators' responses based on a reference response of free-riders, which then allows the generator to exclude the detected free-riders from the training. Furthermore, we extend our defense, termed DFG+, to enable discriminators to filter out free-riders at the variant of MD-GAN that allows peer exchanges of discriminators networks. Extensive evaluation on various scenarios of free-riders, MD-GAN architecture, and three datasets show that our defenses effectively detect free-riders. With 1 to 5 free-riders, DFG and DFG+ averagely decreases FID by 5.22% to 11.53% for CIFAR10 and 5.79% to 13.22% for CIFAR100 in comparison to an attack without defense. In a shell, the proposed DFG(+) can effectively defend against free-riders without affecting benign clients at a negligible computation overhead.

CVSep 19, 2021
ComicGAN: Text-to-Comic Generative Adversarial Network

Ben Proven-Bessel, Zilong Zhao, Lydia Chen

Drawing and annotating comic illustrations is a complex and difficult process. No existing machine learning algorithms have been developed to create comic illustrations based on descriptions of illustrations, or the dialogue in comics. Moreover, it is not known if a generative adversarial network (GAN) can generate original comics that correspond to the dialogue and/or descriptions. GANs are successful in producing photo-realistic images, but this technology does not necessarily translate to generation of flawless comics. What is more, comic evaluation is a prominent challenge as common metrics such as Inception Score will not perform comparably, as they are designed to work on photos. In this paper: 1. We implement ComicGAN, a novel text-to-comic pipeline based on a text-to-image GAN that synthesizes comics according to text descriptions. 2. We describe an in-depth empirical study of the technical difficulties of comic generation using GAN's. ComicGAN has two novel features: (i) text description creation from labels via permutation and augmentation, and (ii) custom image encoding with Convolutional Neural Networks. We extensively evaluate the proposed ComicGAN in two scenarios, namely image generation from descriptions, and image generation from dialogue. Our results on 1000 Dilbert comic panels and 6000 descriptions show synthetic comic panels from text inputs resemble original Dilbert panels. Novel methods for text description creation and custom image encoding brought improvements to Frechet Inception Distance, detail, and overall image quality over baseline algorithms. Generating illustrations from descriptions provided clear comics including characters and colours that were specified in the descriptions.

LGJul 6, 2021
DTGAN: Differential Private Training for Tabular GANs

Aditya Kunar, Robert Birke, Zilong Zhao et al.

Tabular generative adversarial networks (TGAN) have recently emerged to cater to the need of synthesizing tabular data -- the most widely used data format. While synthetic tabular data offers the advantage of complying with privacy regulations, there still exists a risk of privacy leakage via inference attacks due to interpolating the properties of real data during training. Differential private (DP) training algorithms provide theoretical guarantees for training machine learning models by injecting statistical noise to prevent privacy leaks. However, the challenges of applying DP on TGAN are to determine the most optimal framework (i.e., PATE/DP-SGD) and neural network (i.e., Generator/Discriminator)to inject noise such that the data utility is well maintained under a given privacy guarantee. In this paper, we propose DTGAN, a novel conditional Wasserstein tabular GAN that comes in two variants DTGAN_G and DTGAN_D, for providing a detailed comparison of tabular GANs trained using DP-SGD for the generator vs discriminator, respectively. We elicit the privacy analysis associated with training the generator with complex loss functions (i.e., classification and information losses) needed for high quality tabular data synthesis. Additionally, we rigorously evaluate the theoretical privacy guarantees offered by DP empirically against membership and attribute inference attacks. Our results on 3 datasets show that the DP-SGD framework is superior to PATE and that a DP discriminator is more optimal for training convergence. Thus, we find (i) DTGAN_D is capable of maintaining the highest data utility across 4 ML models by up to 18% in terms of the average precision score for a strict privacy budget, epsilon = 1, as compared to the prior studies and (ii) DP effectively prevents privacy loss against inference attacks by restricting the success probability of membership attacks to be close to 50%.

LGMar 19, 2021
Enhancing Robustness of On-line Learning Models on Highly Noisy Data

Zilong Zhao, Robert Birke, Rui Han et al.

Classification algorithms have been widely adopted to detect anomalies for various systems, e.g., IoT, cloud and face recognition, under the common assumption that the data source is clean, i.e., features and labels are correctly set. However, data collected from the wild can be unreliable due to careless annotations or malicious data transformation for incorrect anomaly detection. In this paper, we extend a two-layer on-line data selection framework: Robust Anomaly Detector (RAD) with a newly designed ensemble prediction where both layers contribute to the final anomaly detection decision. To adapt to the on-line nature of anomaly detection, we consider additional features of conflicting opinions of classifiers, repetitive cleaning, and oracle knowledge. We on-line learn from incoming data streams and continuously cleanse the data, so as to adapt to the increasing learning capacity from the larger accumulated data set. Moreover, we explore the concept of oracle learning that provides additional information of true labels for difficult data points. We specifically focus on three use cases, (i) detecting 10 classes of IoT attacks, (ii) predicting 4 classes of task failures of big data jobs, and (iii) recognising 100 celebrities faces. Our evaluation results show that RAD can robustly improve the accuracy of anomaly detection, to reach up to 98.95% for IoT device attacks (i.e., +7%), up to 85.03% for cloud task failures (i.e., +14%) under 40% label noise, and for its extension, it can reach up to 77.51% for face recognition (i.e., +39%) under 30% label noise. The proposed RAD and its extensions are general and can be applied to different anomaly detection algorithms.

LGFeb 16, 2021
CTAB-GAN: Effective Table Data Synthesizing

Zilong Zhao, Aditya Kunar, Hiek Van der Scheer et al.

While data sharing is crucial for knowledge development, privacy concerns and strict regulation (e.g., European General Data Protection Regulation (GDPR)) unfortunately limit its full effectiveness. Synthetic tabular data emerges as an alternative to enable data sharing while fulfilling regulatory and privacy constraints. The state-of-the-art tabular data synthesizers draw methodologies from generative Adversarial Networks (GAN) and address two main data types in the industry, i.e., continuous and categorical. In this paper, we develop CTAB-GAN, a novel conditional table GAN architecture that can effectively model diverse data types, including a mix of continuous and categorical variables. Moreover, we address data imbalance and long-tail issues, i.e., certain variables have drastic frequency differences across large values. To achieve those aims, we first introduce the information loss and classification loss to the conditional GAN. Secondly, we design a novel conditional vector, which efficiently encodes the mixed data type and skewed distribution of data variable. We extensively evaluate CTAB-GAN with the state of the art GANs that generate synthetic tables, in terms of data similarity and analysis utility. The results on five datasets show that the synthetic data of CTAB-GAN remarkably resembles the real data for all three types of variables and results into higher accuracy for five machine learning algorithms, by up to 17%.

LGMar 20, 2020
Event-Based Control for Online Training of Neural Networks

Zilong Zhao, Sophie Cerf, Bogdan Robu et al.

Convolutional Neural Network (CNN) has become the most used method for image classification tasks. During its training the learning rate and the gradient are two key factors to tune for influencing the convergence speed of the model. Usual learning rate strategies are time-based i.e. monotonous decay over time. Recent state-of-the-art techniques focus on adaptive gradient algorithms i.e. Adam and its versions. In this paper we consider an online learning scenario and we propose two Event-Based control loops to adjust the learning rate of a classical algorithm E (Exponential)/PD (Proportional Derivative)-Control. The first Event-Based control loop will be implemented to prevent sudden drop of the learning rate when the model is approaching the optimum. The second Event-Based control loop will decide, based on the learning speed, when to switch to the next data batch. Experimental evaluationis provided using two state-of-the-art machine learning image datasets (CIFAR-10 and CIFAR-100). Results show the Event-Based E/PD is better than the original algorithm (higher final accuracy, lower final loss value), and the Double-Event-BasedE/PD can accelerate the training process, save up to 67% training time compared to state-of-the-art algorithms and even result in better performance.

LGJan 28, 2020
QActor: On-line Active Learning for Noisy Labeled Stream Data

Taraneh Younesian, Zilong Zhao, Amirmasoud Ghiassi et al.

Noisy labeled data is more a norm than a rarity for self-generated content that is continuously published on the web and social media. Due to privacy concerns and governmental regulations, such a data stream can only be stored and used for learning purposes in a limited duration. To overcome the noise in this on-line scenario we propose QActor which novel combines: the selection of supposedly clean samples via quality models and actively querying an oracle for the most informative true labels. While the former can suffer from low data volumes of on-line scenarios, the latter is constrained by the availability and costs of human experts. QActor swiftly combines the merits of quality models for data filtering and oracle queries for cleaning the most informative data. The objective of QActor is to leverage the stringent oracle budget to robustly maximize the learning accuracy. QActor explores various strategies combining different query allocations and uncertainty measures. A central feature of QActor is to dynamically adjust the query limit according to the learning loss for each data batch. We extensively evaluate different image datasets fed into the classifier that can be standard machine learning (ML) models or deep neural networks (DNN) with noise label ratios ranging between 30% and 80%. Our results show that QActor can nearly match the optimal accuracy achieved using only clean data at the cost of at most an additional 6% of ground truth data from the oracle.

LGNov 18, 2019
Feedback Control for Online Training of Neural Networks

Zilong Zhao, Sophie Cerf, Bogdan Robu et al.

Convolutional neural networks (CNNs) are commonly used for image classification tasks, raising the challenge of their application on data flows. During their training, adaptation is often performed by tuning the learning rate. Usual learning rate strategies are time-based i.e. monotonously decreasing. In this paper, we advocate switching to a performance-based adaptation, in order to improve the learning efficiency. We present E (Exponential)/PD (Proportional Derivative)-Control, a conditional learning rate strategy that combines a feedback PD controller based on the CNN loss function, with an exponential control signal to smartly boost the learning and adapt the PD parameters. Stability proof is provided as well as an experimental evaluation using two state of the art image datasets (CIFAR-10 and Fashion-MNIST). Results show better performances than the related works (faster network accuracy growth reaching higher levels) and robustness of the E/PD-Control regarding its parametrization.

LGNov 11, 2019
RAD: On-line Anomaly Detection for Highly Unreliable Data

Zilong Zhao, Robert Birke, Rui Han et al.

Classification algorithms have been widely adopted to detect anomalies for various systems, e.g., IoT, cloud and face recognition, under the common assumption that the data source is clean, i.e., features and labels are correctly set. However, data collected from the wild can be unreliable due to careless annotations or malicious data transformation for incorrect anomaly detection. In this paper, we present a two-layer on-line learning framework for robust anomaly detection (RAD) in the presence of unreliable anomaly labels, where the first layer is to filter out the suspicious data, and the second layer detects the anomaly patterns from the remaining data. To adapt to the on-line nature of anomaly detection, we extend RAD with additional features of repetitively cleaning, conflicting opinions of classifiers, and oracle knowledge. We on-line learn from the incoming data streams and continuously cleanse the data, so as to adapt to the increasing learning capacity from the larger accumulated data set. Moreover, we explore the concept of oracle learning that provides additional information of true labels for difficult data points. We specifically focus on three use cases, (i) detecting 10 classes of IoT attacks, (ii) predicting 4 classes of task failures of big data jobs, (iii) recognising 20 celebrities faces. Our evaluation results show that RAD can robustly improve the accuracy of anomaly detection, to reach up to 98% for IoT device attacks (i.e., +11%), up to 84% for cloud task failures (i.e., +20%) under 40% noise, and up to 74% for face recognition (i.e., +28%) under 30% noisy labels. The proposed RAD is general and can be applied to different anomaly detection algorithms.