LGMar 1, 2023Code
Combating Exacerbated Heterogeneity for Robust Models in Federated LearningJianing Zhu, Jiangchao Yao, Tongliang Liu et al. · tsinghua
Privacy and security concerns in real-world applications have led to the development of adversarially robust federated models. However, the straightforward combination between adversarial training and federated learning in one framework can lead to the undesired robustness deterioration. We discover that the attribution behind this phenomenon is that the generated adversarial data could exacerbate the data heterogeneity among local clients, making the wrapped federated learning perform poorly. To deal with this problem, we propose a novel framework called Slack Federated Adversarial Training (SFAT), assigning the client-wise slack during aggregation to combat the intensified heterogeneity. Theoretically, we analyze the convergence of the proposed method to properly relax the objective when combining federated learning and adversarial training. Experimentally, we verify the rationality and effectiveness of SFAT on various benchmarked and real-world datasets with different adversarial training and federated optimization methods. The code is publicly available at https://github.com/ZFancy/SFAT.
LGJun 6, 2023Code
Unleashing Mask: Explore the Intrinsic Out-of-Distribution Detection CapabilityJianing Zhu, Hengzhuang Li, Jiangchao Yao et al.
Out-of-distribution (OOD) detection is an indispensable aspect of secure AI when deploying machine learning models in real-world applications. Previous paradigms either explore better scoring functions or utilize the knowledge of outliers to equip the models with the ability of OOD detection. However, few of them pay attention to the intrinsic OOD detection capability of the given model. In this work, we generally discover the existence of an intermediate stage of a model trained on in-distribution (ID) data having higher OOD detection performance than that of its final stage across different settings, and further identify one critical data-level attribution to be learning with the atypical samples. Based on such insights, we propose a novel method, Unleashing Mask, which aims to restore the OOD discriminative capabilities of the well-trained model with ID data. Our method utilizes a mask to figure out the memorized atypical samples, and then finetune the model or prune it with the introduced mask to forget them. Extensive experiments and analysis demonstrate the effectiveness of our method. The code is available at: https://github.com/tmlr-group/Unleashing-Mask.
CRSep 4, 2024Code
"Yes, My LoRD." Guiding Language Model Extraction with Locality Reinforced DistillationZi Liang, Qingqing Ye, Yanyun Wang et al.
Model extraction attacks (MEAs) on large language models (LLMs) have received increasing attention in recent research. However, existing attack methods typically adapt the extraction strategies originally developed for deep neural networks (DNNs). They neglect the underlying inconsistency between the training tasks of MEA and LLM alignment, leading to suboptimal attack performance. To tackle this issue, we propose Locality Reinforced Distillation (LoRD), a novel model extraction algorithm specifically designed for LLMs. In particular, LoRD employs a newly defined policy-gradient-style training task that utilizes the responses of victim model as the signal to guide the crafting of preference for the local model. Theoretical analyses demonstrate that I) The convergence procedure of LoRD in model extraction is consistent with the alignment procedure of LLMs, and II) LoRD can reduce query complexity while mitigating watermark protection through our exploration-based stealing. Extensive experiments validate the superiority of our method in extracting various state-of-the-art commercial LLMs. Our code is available at: https://github.com/liangzid/LoRD-MEA .
LGJan 7Code
Inference Attacks Against Graph Generative Diffusion ModelsXiuling Wang, Xin Huang, Guibo Luo et al.
Graph generative diffusion models have recently emerged as a powerful paradigm for generating complex graph structures, effectively capturing intricate dependencies and relationships within graph data. However, the privacy risks associated with these models remain largely unexplored. In this paper, we investigate information leakage in such models through three types of black-box inference attacks. First, we design a graph reconstruction attack, which can reconstruct graphs structurally similar to those training graphs from the generated graphs. Second, we propose a property inference attack to infer the properties of the training graphs, such as the average graph density and the distribution of densities, from the generated graphs. Third, we develop two membership inference attacks to determine whether a given graph is present in the training set. Extensive experiments on three different types of graph generative diffusion models and six real-world graphs demonstrate the effectiveness of these attacks, significantly outperforming the baseline approaches. Finally, we propose two defense mechanisms that mitigate these inference attacks and achieve a better trade-off between defense strength and target model utility than existing methods. Our code is available at https://zenodo.org/records/17946102.
LGAug 6, 2022
AUTOSHAPE: An Autoencoder-Shapelet Approach for Time Series ClusteringGuozhong Li, Byron Choi, Jianliang Xu et al.
Time series shapelets are discriminative subsequences that have been recently found effective for time series clustering (TSC). The shapelets are convenient for interpreting the clusters. Thus, the main challenge for TSC is to discover high-quality variable-length shapelets to discriminate different clusters. In this paper, we propose a novel autoencoder-shapelet approach (AUTOSHAPE), which is the first study to take the advantage of both autoencoder and shapelet for determining shapelets in an unsupervised manner. An autoencoder is specially designed to learn high-quality shapelets. More specifically, for guiding the latent representation learning, we employ the latest self-supervised loss to learn the unified embeddings for variable-length shapelet candidates (time series subsequences) of different variables, and propose the diversity loss to select the discriminating embeddings in the unified space. We introduce the reconstruction loss to recover shapelets in the original time series space for clustering. Finally, we adopt Davies Bouldin index (DBI) to inform AUTOSHAPE of the clustering performance during learning. We present extensive experiments on AUTOSHAPE. To evaluate the clustering performance on univariate time series (UTS), we compare AUTOSHAPE with 15 representative methods using UCR archive datasets. To study the performance of multivariate time series (MTS), we evaluate AUTOSHAPE on 30 UEA archive datasets with 5 competitive methods. The results validate that AUTOSHAPE is the best among all the methods compared. We interpret clusters with shapelets, and can obtain interesting intuitions about clusters in two UTS case studies and one MTS case study, respectively.
LGJan 22
Communication-efficient Federated Graph Classification via Generative Diffusion ModelingXiuling Wang, Xin Huang, Haibo Hu et al.
Graph Neural Networks (GNNs) unlock new ways of learning from graph-structured data, proving highly effective in capturing complex relationships and patterns. Federated GNNs (FGNNs) have emerged as a prominent distributed learning paradigm for training GNNs over decentralized data. However, FGNNs face two significant challenges: high communication overhead from multiple rounds of parameter exchanges and non-IID data characteristics across clients. To address these issues, we introduce CeFGC, a novel FGNN paradigm that facilitates efficient GNN training over non-IID data by limiting communication between the server and clients to three rounds only. The core idea of CeFGC is to leverage generative diffusion models to minimize direct client-server communication. Each client trains a generative diffusion model that captures its local graph distribution and shares this model with the server, which then redistributes it back to all clients. Using these generative models, clients generate synthetic graphs combined with their local graphs to train local GNN models. Finally, clients upload their model weights to the server for aggregation into a global GNN model. We theoretically analyze the I/O complexity of communication volume to show that CeFGC reduces to a constant of three communication rounds only. Extensive experiments on several real graph datasets demonstrate the effectiveness and efficiency of CeFGC against state-of-the-art competitors, reflecting our superior performance on non-IID graphs by aligning local and global model objectives and enriching the training set with diverse graphs.
AIAug 18, 2022
Pathway to Future Symbiotic CreativityYike Guo, Qifeng Liu, Jie Chen et al.
This report presents a comprehensive view of our vision on the development path of the human-machine symbiotic art creation. We propose a classification of the creative system with a hierarchy of 5 classes, showing the pathway of creativity evolving from a mimic-human artist (Turing Artists) to a Machine artist in its own right. We begin with an overview of the limitations of the Turing Artists then focus on the top two-level systems, Machine Artists, emphasizing machine-human communication in art creation. In art creation, it is necessary for machines to understand humans' mental states, including desires, appreciation, and emotions, humans also need to understand machines' creative capabilities and limitations. The rapid development of immersive environment and further evolution into the new concept of metaverse enable symbiotic art creation through unprecedented flexibility of bi-directional communication between artists and art manifestation environments. By examining the latest sensor and XR technologies, we illustrate the novel way for art data collection to constitute the base of a new form of human-machine bidirectional communication and understanding in art creation. Based on such communication and understanding mechanisms, we propose a novel framework for building future Machine artists, which comes with the philosophy that a human-compatible AI system should be based on the "human-in-the-loop" principle rather than the traditional "end-to-end" dogma. By proposing a new form of inverse reinforcement learning model, we outline the platform design of machine artists, demonstrate its functions and showcase some examples of technologies we have developed. We also provide a systematic exposition of the ecosystem for AI-based symbiotic art form and community with an economic model built on NFT technology. Ethical issues for the development of machine artists are also discussed.
51.9DBApr 22
HIRE: A Hybrid Learned Index for Robust and Efficient Performance under Mixed WorkloadsXinyi Zhang, Liang Liang, Anastasia Ailamaki et al.
Indexes are critical for efficient data retrieval and updates in modern databases. Recent advances in machine learning have led to the development of learned indexes, which model the cumulative distribution function of data to predict search positions and accelerate query processing. While learned indexes substantially outperform traditional structures for point lookups, they often suffer from high tail latency, suboptimal range query performance, and inconsistent effectiveness across diverse workloads. To address these challenges, this paper proposes HIRE, a hybrid in-memory index structure designed to deliver efficient performance consistently. HIRE combines the structural and performance robustness of traditional indexes with the predictive power of model-based prediction to reduce search overhead while maintaining worst-case stability. Specifically, it employs (1) hybrid leaf nodes adaptive to varying data distributions and workloads, (2) model-accelerated internal nodes augmented by log-based updates for efficient updates, (3) a nonblocking, cost-driven recalibration mechanism for dynamic data, and (4) an inter-level optimized bulk-loading algorithm accounting for leaf and internal-node errors. Experimental results on multiple real-world datasets demonstrate that HIRE outperforms both state-of-the-art learned indexes and traditional structures in range-query throughput, tail latency, and overall stability. Compared to state-of-the-art learned indexes and traditional indexes, HIRE achieves up to 41.7$\times$ higher throughput under mixed workloads, reduces tail latency by up to 98% across varying scenarios.
SIMar 26, 2025Code
Adaptive Local Clustering over Attributed GraphsHaoran Zheng, Renchi Yang, Jianliang Xu
Given a graph $G$ and a seed node $v_s$, the objective of local graph clustering (LGC) is to identify a subgraph $C_s \in G$ (a.k.a. local cluster) surrounding $v_s$ in time roughly linear with the size of $C_s$. This approach yields personalized clusters without needing to access the entire graph, which makes it highly suitable for numerous applications involving large graphs. However, most existing solutions merely rely on the topological connectivity between nodes in $G$, rendering them vulnerable to missing or noisy links that are commonly present in real-world graphs. To address this issue, this paper resorts to leveraging the complementary nature of graph topology and node attributes to enhance local clustering quality. To effectively exploit the attribute information, we first formulate the LGC as an estimation of the bidirectional diffusion distribution (BDD), which is specialized for capturing the multi-hop affinity between nodes in the presence of attributes. Furthermore, we propose LACA, an efficient and effective approach for LGC that achieves superb empirical performance on multiple real datasets while maintaining strong locality. The core components of LACA include (i) a fast and theoretically-grounded preprocessing technique for node attributes, (ii) an adaptive algorithm for diffusing any vectors over $G$ with rigorous theoretical guarantees and expedited convergence, and (iii) an effective three-step scheme for BDD approximation. Extensive experiments, comparing 17 competitors on 8 real datasets, show that LACA outperforms all competitors in terms of result quality measured against ground truth local clusters, while also being up to orders of magnitude faster. The code is available at https://github.com/HaoranZ99/alac.
AINov 6, 2024
LEGO-GraphRAG: Modularizing Graph-based Retrieval-Augmented Generation for Design Space ExplorationYukun Cao, Zengyi Gao, Zhiyang Li et al.
GraphRAG integrates (knowledge) graphs with large language models (LLMs) to improve reasoning accuracy and contextual relevance. Despite its promising applications and strong relevance to multiple research communities, such as databases and natural language processing, GraphRAG currently lacks modular workflow analysis, systematic solution frameworks, and insightful empirical studies. To bridge these gaps, we propose LEGO-GraphRAG, a modular framework that enables: 1) fine-grained decomposition of the GraphRAG workflow, 2) systematic classification of existing techniques and implemented GraphRAG instances, and 3) creation of new GraphRAG instances. Our framework facilitates comprehensive empirical studies of GraphRAG on large-scale real-world graphs and diverse query sets, revealing insights into balancing reasoning quality, runtime efficiency, and token or GPU cost, that are essential for building advanced GraphRAG systems.
LGFeb 26, 2025
A Sample-Level Evaluation and Generative Framework for Model Inversion AttacksHaoyang Li, Li Bai, Qingqing Ye et al.
Model Inversion (MI) attacks, which reconstruct the training dataset of neural networks, pose significant privacy concerns in machine learning. Recent MI attacks have managed to reconstruct realistic label-level private data, such as the general appearance of a target person from all training images labeled on him. Beyond label-level privacy, in this paper we show sample-level privacy, the private information of a single target sample, is also important but under-explored in the MI literature due to the limitations of existing evaluation metrics. To address this gap, this study introduces a novel metric tailored for training-sample analysis, namely, the Diversity and Distance Composite Score (DDCS), which evaluates the reconstruction fidelity of each training sample by encompassing various MI attack attributes. This, in turn, enhances the precision of sample-level privacy assessments. Leveraging DDCS as a new evaluative lens, we observe that many training samples remain resilient against even the most advanced MI attack. As such, we further propose a transfer learning framework that augments the generative capabilities of MI attackers through the integration of entropy loss and natural gradient descent. Extensive experiments verify the effectiveness of our framework on improving state-of-the-art MI attacks over various metrics including DDCS, coverage and FID. Finally, we demonstrate that DDCS can also be useful for MI defense, by identifying samples susceptible to MI attacks in an unsupervised manner.
AIJan 23, 2024
ChatGraph: Chat with Your GraphsYun Peng, Sen Lin, Qian Chen et al.
Graph analysis is fundamental in real-world applications. Traditional approaches rely on SPARQL-like languages or clicking-and-dragging interfaces to interact with graph data. However, these methods either require users to possess high programming skills or support only a limited range of graph analysis functionalities. To address the limitations, we propose a large language model (LLM)-based framework called ChatGraph. With ChatGraph, users can interact with graphs through natural language, making it easier to use and more flexible than traditional approaches. The core of ChatGraph lies in generating chains of graph analysis APIs based on the understanding of the texts and graphs inputted in the user prompts. To achieve this, ChatGraph consists of three main modules: an API retrieval module that searches for relevant APIs, a graph-aware LLM module that enables the LLM to comprehend graphs, and an API chain-oriented finetuning module that guides the LLM in generating API chains.
AISep 14, 2025
AI-Generated Content in Cross-Domain Applications: Research Trends, Challenges and PropositionsJianxin Li, Liang Qu, Taotao Cai et al.
Artificial Intelligence Generated Content (AIGC) has rapidly emerged with the capability to generate different forms of content, including text, images, videos, and other modalities, which can achieve a quality similar to content created by humans. As a result, AIGC is now widely applied across various domains such as digital marketing, education, and public health, and has shown promising results by enhancing content creation efficiency and improving information delivery. However, there are few studies that explore the latest progress and emerging challenges of AIGC across different domains. To bridge this gap, this paper brings together 16 scholars from multiple disciplines to provide a cross-domain perspective on the trends and challenges of AIGC. Specifically, the contributions of this paper are threefold: (1) It first provides a broader overview of AIGC, spanning the training techniques of Generative AI, detection methods, and both the spread and use of AI-generated content across digital platforms. (2) It then introduces the societal impacts of AIGC across diverse domains, along with a review of existing methods employed in these contexts. (3) Finally, it discusses the key technical challenges and presents research propositions to guide future work. Through these contributions, this vision paper seeks to offer readers a cross-domain perspective on AIGC, providing insights into its current research trends, ongoing challenges, and future directions.
LGNov 25, 2025
Cross-Contrastive Clustering for Multimodal Attributed Graphs with Dual Graph FilteringHaoran Zheng, Renchi Yang, Hongtao Wang et al.
Multimodal Attributed Graphs (MMAGs) are an expressive data model for representing the complex interconnections among entities that associate attributes from multiple data modalities (text, images, etc.). Clustering over such data finds numerous practical applications in real scenarios, including social community detection, medical data analytics, etc. However, as revealed by our empirical studies, existing multi-view clustering solutions largely rely on the high correlation between attributes across various views and overlook the unique characteristics (e.g., low modality-wise correlation and intense feature-wise noise) of multimodal attributes output by large pre-trained language and vision models in MMAGs, leading to suboptimal clustering performance. Inspired by foregoing empirical observations and our theoretical analyses with graph signal processing, we propose the Dual Graph Filtering (DGF) scheme, which innovatively incorporates a feature-wise denoising component into node representation learning, thereby effectively overcoming the limitations of traditional graph filters adopted in the extant multi-view graph clustering approaches. On top of that, DGF includes a tri-cross contrastive training strategy that employs instance-level contrastive learning across modalities, neighborhoods, and communities for learning robust and discriminative node representations. Our comprehensive experiments on eight benchmark MMAG datasets exhibit that DGF is able to outperform a wide range of state-of-the-art baselines consistently and significantly in terms of clustering quality measured against ground-truth labels.
LGNov 25, 2025
Rethinking Message Passing Neural Networks with Diffusion Distance-guided Stress MajorizationHaoran Zheng, Renchi Yang, Yubo Zhou et al.
Message passing neural networks (MPNNs) have emerged as go-to models for learning on graph-structured data in the past decade. Despite their effectiveness, most of such models still incur severe issues such as over-smoothing and -correlation, due to their underlying objective of minimizing the Dirichlet energy and the derived neighborhood aggregation operations. In this paper, we propose the DDSM, a new MPNN model built on an optimization framework that includes the stress majorization and orthogonal regularization for overcoming the above issues. Further, we introduce the diffusion distances for nodes into the framework to guide the new message passing operations and develop efficient algorithms for distance approximations, both backed by rigorous theoretical analyses. Our comprehensive experiments showcase that DDSM consistently and considerably outperforms 15 strong baselines on both homophilic and heterophilic graphs.
CRSep 27, 2025
Virus Infection Attack on LLMs: Your Poisoning Can Spread "VIA" Synthetic DataZi Liang, Qingqing Ye, Xuan Liu et al.
Synthetic data refers to artificial samples generated by models. While it has been validated to significantly enhance the performance of large language models (LLMs) during training and has been widely adopted in LLM development, potential security risks it may introduce remain uninvestigated. This paper systematically evaluates the resilience of synthetic-data-integrated training paradigm for LLMs against mainstream poisoning and backdoor attacks. We reveal that such a paradigm exhibits strong resistance to existing attacks, primarily thanks to the different distribution patterns between poisoning data and queries used to generate synthetic samples. To enhance the effectiveness of these attacks and further investigate the security risks introduced by synthetic data, we introduce a novel and universal attack framework, namely, Virus Infection Attack (VIA), which enables the propagation of current attacks through synthetic data even under purely clean queries. Inspired by the principles of virus design in cybersecurity, VIA conceals the poisoning payload within a protective "shell" and strategically searches for optimal hijacking points in benign samples to maximize the likelihood of generating malicious content. Extensive experiments on both data poisoning and backdoor attacks show that VIA significantly increases the presence of poisoning content in synthetic data and correspondingly raises the attack success rate (ASR) on downstream models to levels comparable to those observed in the poisoned upstream models.
LGAug 4, 2025
Skeleton-Guided Learning for Shortest Path SearchTiantian Liu, Xiao Li, Huan Li et al.
Shortest path search is a core operation in graph-based applications, yet existing methods face important limitations. Classical algorithms such as Dijkstra's and A* become inefficient as graphs grow more complex, while index-based techniques often require substantial preprocessing and storage. Recent learning-based approaches typically focus on spatial graphs and rely on context-specific features like geographic coordinates, limiting their general applicability. We propose a versatile learning-based framework for shortest path search on generic graphs, without requiring domain-specific features. At the core of our approach is the construction of a skeleton graph that captures multi-level distance and hop information in a compact form. A Skeleton Graph Neural Network (SGNN) operates on this structure to learn node embeddings and predict distances and hop lengths between node pairs. These predictions support LSearch, a guided search algorithm that uses model-driven pruning to reduce the search space while preserving accuracy. To handle larger graphs, we introduce a hierarchical training strategy that partitions the graph into subgraphs with individually trained SGNNs. This structure enables HLSearch, an extension of our method for efficient path search across graph partitions. Experiments on five diverse real-world graphs demonstrate that our framework achieves strong performance across graph types, offering a flexible and effective solution for learning-based shortest path search.
CLApr 22, 2025
Cequel: Cost-Effective Querying of Large Language Models for Text ClusteringHongtao Wang, Taiyan Zhang, Renchi Yang et al.
Text clustering aims to automatically partition a collection of documents into coherent groups based on their linguistic features. In the literature, this task is formulated either as metric clustering over pre-trained text embeddings or as graph clustering based on pairwise similarities derived from an oracle, e.g., a large machine learning model. Recent advances in large language models (LLMs) have significantly improved this field by providing high-quality contextualized embeddings and accurate semantic similarity estimates. However, leveraging LLMs at scale introduces substantial computational and financial costs due to the large number of required API queries or inference calls. To address this issue, we propose Cequel, a cost-effective framework that achieves accurate text clustering under a limited budget of LLM queries. At its core, Cequel constructs must-link and cannot-link constraints by selectively querying LLMs on informative text pairs or triplets, identified via our proposed algorithms, EdgeLLM and TriangleLLM. These constraints are then utilized in a weighted constrained clustering algorithm to form high-quality clusters. Specifically, EdgeLLM and TriangleLLM employ carefully designed greedy selection strategies and prompting techniques to identify and extract informative constraints efficiently. Experiments on multiple benchmark datasets demonstrate that Cequel consistently outperforms existing methods in unsupervised text clustering under the same query budget.
CLApr 7, 2025
SAFT: Structure-aware Transformers for Textual Interaction ClassificationHongtao Wang, Renchi Yang, Hewen Wang et al.
Textual interaction networks (TINs) are an omnipresent data structure used to model the interplay between users and items on e-commerce websites, social networks, etc., where each interaction is associated with a text description. Classifying such textual interactions (TIC) finds extensive use in detecting spam reviews in e-commerce, fraudulent transactions in finance, and so on. Existing TIC solutions either (i) fail to capture the rich text semantics due to the use of context-free text embeddings, and/or (ii) disregard the bipartite structure and node heterogeneity of TINs, leading to compromised TIC performance. In this work, we propose SAFT, a new architecture that integrates language- and graph-based modules for the effective fusion of textual and structural semantics in the representation learning of interactions. In particular, line graph attention (LGA)/gated attention units (GAUs) and pretrained language models (PLMs) are capitalized on to model the interaction-level and token-level signals, which are further coupled via the proxy token in an iterative and contextualized fashion. Additionally, an efficient and theoretically-grounded approach is developed to encode the local and global topology information pertaining to interactions into structural embeddings. The resulting embeddings not only inject the structural features underlying TINs into the textual interaction encoding but also facilitate the design of graph sampling strategies. Extensive empirical evaluations on multiple real TIN datasets demonstrate the superiority of SAFT over the state-of-the-art baselines in TIC accuracy.
LGMar 27, 2025
AdvSGM: Differentially Private Graph Learning via Adversarial Skip-gram ModelSen Zhang, Qingqing Ye, Haibo Hu et al.
The skip-gram model (SGM), which employs a neural network to generate node vectors, serves as the basis for numerous popular graph embedding techniques. However, since the training datasets contain sensitive linkage information, the parameters of a released SGM may encode private information and pose significant privacy risks. Differential privacy (DP) is a rigorous standard for protecting individual privacy in data analysis. Nevertheless, when applying differential privacy to skip-gram in graphs, it becomes highly challenging due to the complex link relationships, which potentially result in high sensitivity and necessitate substantial noise injection. To tackle this challenge, we present AdvSGM, a differentially private skip-gram for graphs via adversarial training. Our core idea is to leverage adversarial training to privatize skip-gram while improving its utility. Towards this end, we develop a novel adversarial training module by devising two optimizable noise terms that correspond to the parameters of a skip-gram. By fine-tuning the weights between modules within AdvSGM, we can achieve differentially private gradient updates without additional noise injection. Extensive experimental results on six real-world graph datasets show that AdvSGM preserves high data utility across different downstream tasks.
LGJun 12, 2024
Decoupling the Class Label and the Target Concept in Machine UnlearningJianing Zhu, Bo Han, Jiangchao Yao et al.
Machine unlearning as an emerging research topic for data regulations, aims to adjust a trained model to approximate a retrained one that excludes a portion of training data. Previous studies showed that class-wise unlearning is successful in forgetting the knowledge of a target class, through gradient ascent on the forgetting data or fine-tuning with the remaining data. However, while these methods are useful, they are insufficient as the class label and the target concept are often considered to coincide. In this work, we decouple them by considering the label domain mismatch and investigate three problems beyond the conventional all matched forgetting, e.g., target mismatch, model mismatch, and data mismatch forgetting. We systematically analyze the new challenges in restrictively forgetting the target concept and also reveal crucial forgetting dynamics in the representation level to realize these tasks. Based on that, we propose a general framework, namely, TARget-aware Forgetting (TARF). It enables the additional tasks to actively forget the target concept while maintaining the rest part, by simultaneously conducting annealed gradient ascent on the forgetting data and selected gradient descent on the hard-to-affect remaining data. Empirically, various experiments under the newly introduced settings are conducted to demonstrate the effectiveness of our TARF.
LGJun 9, 2021
Reliable Adversarial Distillation with Unreliable TeachersJianing Zhu, Jiangchao Yao, Bo Han et al.
In ordinary distillation, student networks are trained with soft labels (SLs) given by pretrained teacher networks, and students are expected to improve upon teachers since SLs are stronger supervision than the original hard labels. However, when considering adversarial robustness, teachers may become unreliable and adversarial distillation may not work: teachers are pretrained on their own adversarial data, and it is too demanding to require that teachers are also good at every adversarial data queried by students. Therefore, in this paper, we propose reliable introspective adversarial distillation (IAD) where students partially instead of fully trust their teachers. Specifically, IAD distinguishes between three cases given a query of a natural data (ND) and the corresponding adversarial data (AD): (a) if a teacher is good at AD, its SL is fully trusted; (b) if a teacher is good at ND but not AD, its SL is partially trusted and the student also takes its own SL into account; (c) otherwise, the student only relies on its own SL. Experiments demonstrate the effectiveness of IAD for improving upon teachers in terms of adversarial robustness.
CRMar 26, 2021
Do the Rich Get Richer? Fairness Analysis for Blockchain IncentivesYuming Huang, Jing Tang, Qianhao Cong et al.
Proof-of-Work (PoW) is the most widely adopted incentive model in current blockchain systems, which unfortunately is energy inefficient. Proof-of-Stake (PoS) is then proposed to tackle the energy issue. The rich-get-richer concern of PoS has been heavily debated in the blockchain community. The debate is centered around the argument that whether rich miners possessing more stakes will obtain higher staking rewards and further increase their potential income in the future. In this paper, we define two types of fairness, i.e., expectational fairness and robust fairness, that are useful for answering this question. In particular, expectational fairness illustrates that the expected income of a miner is proportional to her initial investment, indicating that the expected return on investment is a constant. To better capture the uncertainty of mining outcomes, robust fairness is proposed to characterize whether the return on investment concentrates to a constant with high probability as time evolves. Our analysis shows that the classical PoW mechanism can always preserve both types of fairness as long as the mining game runs for a sufficiently long time. Furthermore, we observe that current PoS blockchains implement various incentive models and discuss three representatives, namely ML-PoS, SL-PoS and C-PoS. We find that (i) ML-PoS (e.g., Qtum and Blackcoin) preserves expectational fairness but may not achieve robust fairness, (ii) SL-PoS (e.g., NXT) does not protect any type of fairness, and (iii) C-PoS (e.g., Ethereum 2.0) outperforms ML-PoS in terms of robust fairness while still maintaining expectational fairness. Finally, massive experiments on real blockchain systems and extensive numerical simulations are performed to validate our analysis.
SIJan 24, 2021
BU-Trace: A Permissionless Mobile System for Privacy-Preserving Intelligent Contact TracingZhe Peng, Jinbin Huang, Haixin Wang et al.
The coronavirus disease 2019 (COVID-19) pandemic has caused an unprecedented health crisis for the global. Digital contact tracing, as a transmission intervention measure, has shown its effectiveness on pandemic control. Despite intensive research on digital contact tracing, existing solutions can hardly meet users' requirements on privacy and convenience. In this paper, we propose BU-Trace, a novel permissionless mobile system for privacy-preserving intelligent contact tracing based on QR code and NFC technologies. First, a user study is conducted to investigate and quantify the user acceptance of a mobile contact tracing system. Second, a decentralized system is proposed to enable contact tracing while protecting user privacy. Third, an intelligent behavior detection algorithm is designed to ease the use of our system. We implement BU-Trace and conduct extensive experiments in several real-world scenarios. The experimental results show that BU-Trace achieves a privacy-preserving and intelligent mobile system for contact tracing without requesting location or other privacy-related permissions.
CRJan 16, 2021
AGChain: A Blockchain-based Gateway for Trustworthy App Delegation from Mobile App MarketsMengjie Chen, Xiao Yi, Daoyuan Wu et al.
The popularity of smartphones has led to the growth of mobile app markets, creating a need for enhanced transparency, global access, and secure downloading. This paper introduces AGChain, a blockchain-based gateway that enables trustworthy app delegation within existing markets. AGChain ensures that markets can continue providing services while users benefit from permanent, distributed, and secure app delegation. During its development, we address two key challenges: significantly reducing smart contract gas costs and enabling fully distributed IPFS-based file storage. Additionally, we tackle three system issues related to security and sustainability. We have implemented a prototype of AGChain on Ethereum and Polygon blockchains, achieving effective security and decentralization with a minimal gas cost of around 0.002 USD per app upload (no cost for app download). The system also exhibits reasonable performance with an average overhead of 12%.
LGAug 26, 2020
Graph Learning for Combinatorial Optimization: A Survey of State-of-the-ArtYun Peng, Byron Choi, Jianliang Xu
Graphs have been widely used to represent complex data in many applications. Efficient and effective analysis of graphs is important for graph-based applications. However, most graph analysis tasks are combinatorial optimization (CO) problems, which are NP-hard. Recent studies have focused a lot on the potential of using machine learning (ML) to solve graph-based CO problems. Most recent methods follow the two-stage framework. The first stage is graph representation learning, which embeds the graphs into low-dimension vectors. The second stage uses ML to solve the CO problems using the embeddings of the graphs learned in the first stage. The works for the first stage can be classified into two categories, graph embedding (GE) methods and end-to-end (E2E) learning methods. For GE methods, learning graph embedding has its own objective, which may not rely on the CO problems to be solved. The CO problems are solved by independent downstream tasks. For E2E learning methods, the learning of graph embeddings does not have its own objective and is an intermediate step of the learning procedure of solving the CO problems. The works for the second stage can also be classified into two categories, non-autoregressive methods and autoregressive methods. Non-autoregressive methods predict a solution for a CO problem in one shot. A non-autoregressive method predicts a matrix that denotes the probability of each node/edge being a part of a solution of the CO problem. The solution can be computed from the matrix. Autoregressive methods iteratively extend a partial solution step by step. At each step, an autoregressive method predicts a node/edge conditioned to current partial solution, which is used to its extension. In this survey, we provide a thorough overview of recent studies of the graph learning-based CO methods. The survey ends with several remarks on future research directions.
CRNov 11, 2019
Cost-Effective Data Feeds to Blockchains via Workload-Adaptive Data ReplicationKai Li, Yuzhe Tang, Jiaqi Chen et al.
Feeding external data to a blockchain, a.k.a. data feed, is an essential task to enable blockchain interoperability and support emerging cross-domain applications, notably stablecoins. Given the data-intensive feeds in real life (e.g., high-frequency price updates) and the high cost in using blockchain, namely Gas, it is imperative to reduce the Gas cost of data feeds. Motivated by the constant-changing workloads in finance and other applications, this work focuses on designing a dynamic, workload-aware approach for cost effectiveness in Gas. This design space is understudied in the existing blockchain research which has so far focused on static data placement. This work presents GRuB, a cost-effective data feed that dynamically replicates data between the blockchain and an off-chain cloud storage. GRuB's data replication is workload-adaptive by monitoring the current workload and making online decisions w.r.t. data replication. A series of online algorithms are proposed that achieve the bounded worst-case cost in blockchain's Gas. GRuB runs the decision-making components on the untrusted cloud off-chain for lower Gas costs, and employs a security protocol to authenticate the data transferred between the blockchain and cloud. The overall GRuB system can autonomously achieve low Gas costs with changing workloads. We built a GRuB prototype functional with Ethereum and Google LevelDB, and supported real applications in stablecoins. Under real workloads collected from the Ethereum contract-call history and mixed workloads of YCSB, we systematically evaluate GRuB's cost which shows a saving of Gas by 10% ~ 74%, with comparison to the baselines of static data-placement.
CRApr 26, 2019
Authenticated Key-Value Stores with Hardware EnclavesYuzhe Tang, Ju Chen, Kai Li et al.
Authenticated data storage on an untrusted platform is an important computing paradigm for cloud applications ranging from big-data outsourcing, to cryptocurrency and certificate transparency log. These modern applications increasingly feature update-intensive workloads, whereas existing authenticated data structures (ADSs) designed with in-place updates are inefficient to handle such workloads. In this paper, we address this issue and propose a novel authenticated log-structured merge tree (eLSM) based key-value store by leveraging Intel SGX enclaves. We present a system design that runs the code of eLSM store inside enclave. To circumvent the limited enclave memory (128 MB with the latest Intel CPUs), we propose to place the memory buffer of the eLSM store outside the enclave and protect the buffer using a new authenticated data structure by digesting individual LSM-tree levels. We design protocols to support query authentication in data integrity, completeness (under range queries), and freshness. The proof in our protocol is made small by including only the Merkle proofs at selective levels. We implement eLSM on top of Google LevelDB and Facebook RocksDB with minimal code change and performance interference. We evaluate the performance of eLSM under the YCSB workload benchmark and show a performance advantage of up to 4.5X speedup.
CRApr 14, 2019
Secure Consistency Verification for Untrusted Cloud Storage by Public BlockchainsKai Li, Yuzhe Tang, Beom Heyn Kim et al.
This work presents ContractChecker, a Blockchain-based security protocol for verifying the storage consistency between the mutually distrusting cloud provider and clients. Unlike existing protocols, the ContractChecker uniquely delegates log auditing to the Blockchain, and has the advantages in reducing client cost and lowering requirements on client availability, lending itself to modern scenarios with mobile and web clients. The ContractChecker collects the logs from both clients and the cloud server, and verifies the consistency by cross-checking the logs. By this means, it does not only detects the attacks from malicious clients and server forging their logs, but also is able to mitigate those attacks and recover the system from them. In addition, we design new attacks against ContractChecker exploiting various limits in real Blockchain systems (e.g., write unavailability, Blockchain forks, contract race conditions). We analyze and harden the security of ContractChecker protocols against the proposed new attacks. For evaluating the cost, we build a functional prototype of the ContractChecker on Ethereum/Solidity. By experiments on private and public Ethereum testnets, we extensively evaluate the cost of the ContractChecker in comparison with that of existing client-based log auditing works. The result shows the ContractChecker can scale to hundreds of clients and save client costs by more than one order of magnitude.
DBDec 6, 2018
vChain: Enabling Verifiable Boolean Range Queries over Blockchain DatabasesCheng Xu, Ce Zhang, Jianliang Xu
Blockchains have recently been under the spotlight due to the boom of cryptocurrencies and decentralized applications. There is an increasing demand for querying the data stored in a blockchain database. To ensure query integrity, the user can maintain the entire blockchain database and query the data locally. However, this approach is not economic, if not infeasible, because of the blockchain's huge data size and considerable maintenance costs. In this paper, we take the first step toward investigating the problem of verifiable query processing over blockchain databases. We propose a novel framework, called vChain, that alleviates the storage and computing costs of the user and employs verifiable queries to guarantee the results' integrity. To support verifiable Boolean range queries, we propose an accumulator-based authenticated data structure that enables dynamic aggregation over arbitrary query attributes. Two new indexes are further developed to aggregate intra-block and inter-block data records for efficient query verification. We also propose an inverted prefix tree structure to accelerate the processing of a large number of subscription queries simultaneously. Security analysis and empirical study validate the robustness and practicality of the proposed techniques.