CRAIMay 6, 2025

MergeGuard: Efficient Thwarting of Trojan Attacks in Machine Learning Models

arXiv:2505.04015v1h-index: 68
Originality Incremental advance
AI Analysis

This addresses security threats for users of machine learning models trained by untrusted third parties, representing an incremental improvement over existing fine-tuning mitigation methods.

The paper tackles the problem of Trojan attacks in AI models by proposing MergeGuard, a post-training method that linearizes and merges fully connected layers, which reduces the Trojan attack success rate while maintaining model accuracy, as demonstrated in evaluations on Transformer models.

This paper proposes MergeGuard, a novel methodology for mitigation of AI Trojan attacks. Trojan attacks on AI models cause inputs embedded with triggers to be misclassified to an adversary's target class, posing a significant threat to model usability trained by an untrusted third party. The core of MergeGuard is a new post-training methodology for linearizing and merging fully connected layers which we show simultaneously improves model generalizability and performance. Our Proof of Concept evaluation on Transformer models demonstrates that MergeGuard maintains model accuracy while decreasing trojan attack success rate, outperforming commonly used (post-training) Trojan mitigation by fine-tuning methodologies.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes