Dongfang Zhao

CR
h-index46
36papers
120citations
Novelty53%
AI Score56

36 Papers

36.7IRJun 2Code
Slipstream: Locality-Aware Graph Index Construction for Streaming Approximate Nearest Neighbor Search

Shubing Yang, Dongfang Zhao

Graph indexes are widely used for high-recall approximate nearest neighbor search (ANNS), but many real-time applications require streaming ANNS. In these real-time applications, continuously arriving embeddings must search the existing graph for candidate neighbors before updating graph edges, which makes repeated index construction a bottleneck for streaming ingestion workloads. We propose Slipstream, a new method that significantly reduces the computational cost of frequent insertions in graph indexes for ANNS. The core idea of Slipstream is exploiting the continuity in vector streams: the newly arrived point starts from promising candidates found during the previous insertion rather than searching from the entry point. More technically, Slipstream evaluates distinct subsets of starting candidates followed by an adaptive controller that narrows or widens the range according to the stream's stability. We further show that Slipstream is beyond heuristic: We derive an abstract model to characterize Slipstream's performance and analyze its theoretical bounds. We have implemented Slipstream in two popular open-source libraries (Faiss, HNSWLib) and compared it with four baseline methods on five streaming vector datasets. Experimental results show that Slipstream achieves up to 30.8$\times$ higher end-to-end throughput than baselines while maintaining at least 0.95 recall@10.

IVOct 14, 2023Code
UCM-Net: A Lightweight and Efficient Solution for Skin Lesion Segmentation using MLP and CNN

Chunyu Yuan, Dongfang Zhao, Sos S. Agaian

Skin cancer poses a significant public health challenge, necessitating efficient diagnostic tools. We introduce UCM-Net, a novel skin lesion segmentation model combining Multi-Layer Perceptrons (MLP) and Convolutional Neural Networks (CNN). This lightweight, efficient architecture, deviating from traditional UNet designs, dramatically reduces computational demands, making it ideal for mobile health applications. Evaluated on PH2, ISIC 2017, and ISIC 2018 datasets, UCM-Net demonstrates robust performance with fewer than 50KB parameters and requires less than 0.05 Giga Operations Per Second (GLOPs). Moreover, its minimal memory requirement is just 1.19MB in CPU environment positions. It is a potential benchmark for efficiency in skin lesion segmentation, suitable for deployment in resource-constrained settings. In order to facilitate accessibility and further research in the field, the UCM-Net source code is https://github.com/chunyuyuan/UCM-Net.

85.8CLMay 20Code
DIVE: Embedding Compression via Self-Limiting Gradient Updates

Dongfang Zhao

High-dimensional embeddings from large language models impose significant storage and computational costs on vector search systems. Recent embedding compression methods, including Matryoshka-Adaptor (EMNLP 2024), Search-Adaptor (ACL 2024), and SMEC (EMNLP 2025), enable dimensionality reduction through lightweight residual adapters, but their training objectives cause severe overfitting when labeled data is scarce, degrading retrieval performance below the frozen baseline. We propose \textsc{DIVE} (\textbf{D}imensionality reduction with \textbf{I}mplicit \textbf{V}iew \textbf{E}nsembles), a compression adapter that addresses this failure through two mechanisms. First, a self-limiting hinge-based triplet loss produces zero gradient once a triplet satisfies the margin constraint, bounding the total perturbation applied to the pretrained embedding space. Second, a head-wise NT-Xent contrastive loss treats multiple learned projections of each embedding as implicit views, providing dense self-supervised gradients that compensate for the sparsity of the triplet signal on small datasets. Across six BEIR datasets, \textsc{DIVE} outperforms all three baseline adapters on every dataset and at every evaluated compression ratio, with a 14M-parameter open-source implementation.

IRSep 25, 2024
VectorSearch: Enhancing Document Retrieval with Semantic Embeddings and Optimized Search

Solmaz Seyed Monir, Irene Lau, Shubing Yang et al. · uw

Traditional retrieval methods have been essential for assessing document similarity but struggle with capturing semantic nuances. Despite advancements in latent semantic analysis (LSA) and deep learning, achieving comprehensive semantic understanding and accurate retrieval remains challenging due to high dimensionality and semantic gaps. The above challenges call for new techniques to effectively reduce the dimensions and close the semantic gaps. To this end, we propose VectorSearch, which leverages advanced algorithms, embeddings, and indexing techniques for refined retrieval. By utilizing innovative multi-vector search operations and encoding searches with advanced language models, our approach significantly improves retrieval accuracy. Experiments on real-world datasets show that VectorSearch outperforms baseline metrics, demonstrating its efficacy for large-scale retrieval tasks.

39.6CVMay 7
EGA: Adapting Frozen Encoders for Vector Search with Bounded Out-of-Distribution Degradation

Dongfang Zhao

Vector search systems built on frozen vision encoders face queries from unseen classes at deployment, yet existing adapter training collapses under this shift: high-capacity adapters with global contrastive losses silently reassign unseen-class samples to wrong seen-class clusters, dropping worst-case Label Precision by over 40 points below the frozen baseline in our tests. We propose Euclidean Geodesic Alignment (EGA), a residual adapter that couples three principles: zero initialization, local triplet loss, and hypersphere projection. These collectively induce a self-limiting dynamic: triplets that already satisfy a small margin stop producing gradients, so the adapter automatically stops updating where the local geometry is already correct. Our experiments show that at convergence $96.5\%$ of triplets are gradient-free, leaving unseen-class regions largely untouched while still enabling full-capacity refinement of seen classes. Across five diverse out-of-distribution (OOD) benchmarks, EGA achieves the highest worst-case Label Precision on the four primary splits and a consistent improvement on the fifth. The design also transfers to stronger backbones in addition to CLIP, and we provide an analytical justification linking gradient sparsity to bounded OOD perturbation.

DBMar 15, 2023
Comparative Evaluation of Data Decoupling Techniques for Federated Machine Learning with Database as a Service

Muhammad Jahanzeb Khan, Rui Hu, Mohammad Sadoghi et al.

Federated Learning (FL) is a machine learning approach that allows multiple clients to collaboratively learn a shared model without sharing raw data. However, current FL systems provide an all-in-one solution, which can hinder the wide adoption of FL in certain domains such as scientific applications. To overcome this limitation, this paper proposes a decoupling approach that enables clients to customize FL applications with specific data subsystems. To evaluate this approach, the authors develop a framework called Data-Decoupling Federated Learning (DDFL) and compare it with state-of-the-art FL systems that tightly couple data management and computation. Extensive experiments on various datasets and data management subsystems show that DDFL achieves comparable or better performance in terms of training time, inference accuracy, and database query time. Moreover, DDFL provides clients with more options to tune their FL applications regarding data-related metrics. The authors also provide a detailed qualitative analysis of DDFL when integrated with mainstream database systems.

LGAug 15, 2024
Order-Preserving Dimension Reduction for Multimodal Semantic Embedding

Chengyu Gong, Gefei Shen, Luanzheng Guo et al.

Searching for the $k$-nearest neighbors (KNN) in multimodal data retrieval is computationally expensive, particularly due to the inherent difficulty in comparing similarity measures across different modalities. Recent advances in multimodal machine learning address this issue by mapping data into a shared embedding space; however, the high dimensionality of these embeddings (hundreds to thousands of dimensions) presents a challenge for time-sensitive vision applications. This work proposes Order-Preserving Dimension Reduction (OPDR), aiming to reduce the dimensionality of embeddings while preserving the ranking of KNN in the lower-dimensional space. One notable component of OPDR is a new measure function to quantify KNN quality as a global metric, based on which we derive a closed-form map between target dimensionality and key contextual parameters. We have integrated OPDR with multiple state-of-the-art dimension-reduction techniques, distance functions, and embedding models; experiments on a variety of multimodal datasets demonstrate that OPDR effectively retains recall high accuracy while significantly reducing computational costs.

64.1SEMar 28
A Large-Scale Comprehensive Measurement of AI-Generated Code in Real-World Repositories A Large-Scale Comprehensive Measurement of AI-Generated Code in Real-World Repositories

Tianhao Mao, Dongfang Zhao, Haixu Tang et al.

Large language models (LLMs) are rapidly transforming software engineering by enabling developers to generate code ranging from small snippets to entire projects. As AI-generated code becomes increasingly integrated into real-world systems, understanding its characteristics and impact is critical. However, prior work primarily focuses on small-scale, controlled evaluations and lacks comprehensive analysis in real-world settings. In this paper, we present a large-scale empirical study of AI-generated code in real-world repositories. We analyze both code-level metrics (\eg complexity, structure, and defect-related indicators) and commit-level characteristics (\eg commit size, frequency, and post-commit stability). To enable this study, we develop heuristic filter with LLM classification to identify AI-generated code and construct a large dataset. Our results provide new insights into how AI-generated code differs from human-written code and how AI assistance influences development practices. These findings contribute to a deeper understanding of the practical implications of AI-assisted programming.

23.8DBMar 26Code
SIVF: GPU-Resident IVF Index for Streaming Vector Search

Dongfang Zhao

GPU-accelerated Inverted File (IVF) index is one of the industry standards for large-scale vector search but relies on static VRAM layouts that hinder real-time mutability. Our benchmark and analysis reveal that existing designs of GPU IVF necessitate expensive CPU-GPU data transfers for index updates, causing system latency to spike from milliseconds to seconds in streaming scenarios. We present SIVF, a GPU-native index that enables high-velocity, in-place mutation via a series of new data structures and algorithms, such as conflict-free slab allocation and coalesced search on non-contiguous memory. SIVF has been implemented and integrated into the open-source vector search library, Faiss. Evaluation against baselines with diverse vector datasets demonstrates that SIVF reduces deletion latency by orders of magnitude compared to the state-of-the-arts. Furthermore, distributed experiments on a 12-GPU cluster demonstrate that SIVF exhibits near perfect linear scalability, achieving an aggregate ingestion throughput of 4.07 million vectors/s and a deletion throughput of 108.5 million vectors/s.

LGSep 28, 2024
VecLSTM: Trajectory Data Processing and Management for Activity Recognition through LSTM Vectorization and Database Integration

Solmaz Seyed Monir, Dongfang Zhao · uw

Activity recognition is a challenging task due to the large scale of trajectory data and the need for prompt and efficient processing. Existing methods have attempted to mitigate this problem by employing traditional LSTM architectures, but these approaches often suffer from inefficiencies in processing large datasets. In response to this challenge, we propose VecLSTM, a novel framework that enhances the performance and efficiency of LSTM-based neural networks. Unlike conventional approaches, VecLSTM incorporates vectorization layers, leveraging optimized mathematical operations to process input sequences more efficiently. We have implemented VecLSTM and incorporated it into the MySQL database. To evaluate the effectiveness of VecLSTM, we compare its performance against a conventional LSTM model using a dataset comprising 1,467,652 samples with seven unique labels. Experimental results demonstrate superior accuracy and efficiency compared to the state-of-the-art, with VecLSTM achieving a validation accuracy of 85.57\%, a test accuracy of 85.47\%, and a weighted F1-score of 0.86. Furthermore, VecLSTM significantly reduces training time, offering a 26.2\% reduction compared to traditional LSTM models.

IVMay 24, 2024Code
MUCM-Net: A Mamba Powered UCM-Net for Skin Lesion Segmentation

Chunyu Yuan, Dongfang Zhao, Sos S. Agaian

Skin lesion segmentation is key for early skin cancer detection. Challenges in automatic segmentation from dermoscopic images include variations in color, texture, and artifacts of indistinct lesion boundaries. Deep learning methods like CNNs and U-Net have shown promise in addressing these issues. To further aid early diagnosis, especially on mobile devices with limited computing power, we present MUCM-Net. This efficient model combines Mamba State-Space Models with our UCM-Net architecture for improved feature learning and segmentation. MUCM-Net's Mamba-UCM Layer is optimized for mobile deployment, offering high accuracy with low computational needs. Tested on ISIC datasets, it outperforms other methods in accuracy and computational efficiency, making it a scalable tool for early detection in settings with limited resources. Our MUCM-Net source code is available for research and collaboration, supporting advances in mobile health diagnostics and the fight against skin cancer. In order to facilitate accessibility and further research in the field, the MUCM-Net source code is https://github.com/chunyuyuan/MUCM-Net

CLMay 18, 2024Code
EnviroExam: Benchmarking Environmental Science Knowledge of Large Language Models

Yu Huang, Liang Guo, Wanqian Guo et al.

In the field of environmental science, it is crucial to have robust evaluation metrics for large language models to ensure their efficacy and accuracy. We propose EnviroExam, a comprehensive evaluation method designed to assess the knowledge of large language models in the field of environmental science. EnviroExam is based on the curricula of top international universities, covering undergraduate, master's, and doctoral courses, and includes 936 questions across 42 core courses. By conducting 0-shot and 5-shot tests on 31 open-source large language models, EnviroExam reveals the performance differences among these models in the domain of environmental science and provides detailed evaluation standards. The results show that 61.3% of the models passed the 5-shot tests, while 48.39% passed the 0-shot tests. By introducing the coefficient of variation as an indicator, we evaluate the performance of mainstream open-source large language models in environmental science from multiple perspectives, providing effective criteria for selecting and fine-tuning language models in this field. Future research will involve constructing more domain-specific test sets using specialized environmental science textbooks to further enhance the accuracy and specificity of the evaluation.

5.6CRMay 8
Reliable Non-Leveled Homomorphic Encryption for Web Services

Baigang Chen, Dongfang Zhao

With the ubiquitous deployment of web services, ensuring data confidentiality has become a challenging imperative. Fully Homomorphic Encryption (FHE) presents a powerful solution for processing encrypted data; however, its widespread adoption is severely constrained by two fundamental bottlenecks: substantial computational overhead and the absence of a built-in automatic error correction mechanism. These limitations render the deployment of FHE in real-world, complex network environments impractical. To address this dual challenge, this work puts forward a new FHE framework that enhances computational efficiency and integrates an automatic error correction capability through new encoding techniques and an algebraic reliability layer.Our prototype is evaluated through encrypted low-degree activation timing, one experimental public Refresh skeleton invocation, and transport-fault simulations for the Ring--BCH layer. Our current prototype quantifies the cost of encrypted low-degree activation evaluation, the additional latency of an experimental public Refresh skeleton, and the robustness gained from the Ring--BCH transport layer. The Refresh prototype should be interpreted as a skeleton rather than a complete CKKS bootstrapping implementation, since it uses a low-degree surrogate rather than a validated EvalMod circuit. In transport-fault simulations, the BCH interleaver reduces failure rates to below $0.5\%$ under bursty faults and keeps the modeled accuracy within $0.5$ percentage points of the plaintext baseline.

CRMar 1
BadRSSD: Backdoor Attacks on Regularized Self-Supervised Diffusion Models

Jiayao Wang, Yiping Zhang, Mohammad Maruf Hasan et al.

Self-supervised diffusion models learn high-quality visual representations via latent space denoising. However, their representation layer poses a distinct threat: unlike traditional attacks targeting generative outputs, its unconstrained latent semantic space allows for stealthy backdoors, permitting malicious control upon triggering. In this paper, we propose BadRSSD, the first backdoor attack targeting the representation layer of self-supervised diffusion models. Specifically, it hijacks the semantic representations of poisoned samples with triggers in Principal Component Analysis (PCA) space toward those of a target image, then controls the denoising trajectory during diffusion by applying coordinated constraints across latent, pixel, and feature distribution spaces to steer the model toward generating the specified target. Additionally, we integrate representation dispersion regularization into the constraint framework to maintain feature space uniformity, significantly enhancing attack stealth. This approach preserves normal model functionality (high utility) while achieving precise target generation upon trigger activation (high specificity). Experiments on multiple benchmark datasets demonstrate that BadRSSD substantially outperforms existing attacks in both FID and MSE metrics, reliably establishing backdoors across different architectures and configurations, and effectively resisting state-of-the-art backdoor defenses.

11.1CRMar 16
Hermes: Bridging Relational and Algebraic Abstractions in Homomorphically Encrypted Databases

Dongfang Zhao

Fully Homomorphic Encryption (FHE) promises the ability to compute over encrypted data without revealing sensitive contents. Yet, integrating it into real-world relational databases remains elusive due to prohibitive performance overhead and the structural mismatch between mutable database records and static ciphertexts. This paper presents Hermes, a system that enables homomorphically encrypted vectorized relational queries directly inside a standard SQL engine. To bridge the relational and algebraic abstractions, Hermes introduces a SIMD-aware data model that packs multiple records per ciphertext. By embedding precomputed aggregate statistics alongside data slots, the system supports efficient rotation-free aggregations. Furthermore, to overcome ciphertext immutability, we develop data-oblivious homomorphic algorithms based on slot masking and shifting, enabling secure in-place record modifications. Hermes is implemented as native loadable functions in MySQL, marking the first practical integration of FHE into an industrial-grade relational database engine. Extensive evaluations across diverse datasets demonstrate an over 3400x increase in encryption throughput, an over 4000x speedup for tuple insertions, and a 300x acceleration for deletions when compared to conventional scalar FHE implementations.

AGJul 26, 2025
TokenBlowUp: Resolving Representational Singularities in LLM Token Spaces via Monoidal Transformations

Dongfang Zhao

Recent work has provided compelling evidence challenging the foundational manifold hypothesis for the token embedding spaces of Large Language Models (LLMs). These findings reveal the presence of geometric singularities around polysemous tokens, which can lead to representational instability. Existing methodologies, which presuppose a smooth data manifold, are ill-equipped to address such intrinsic structural flaws. In this paper, we formalize this problem in the language of scheme theory and propose a rigorous resolution by applying the scheme-theoretic blow-up at each singular point. This procedure replaces a singular point in the ambient affine scheme with its exceptional divisor, which we identify as a canonical geometric space -- a projective space of directions -- that houses the disambiguated semantic meanings of the token. This process of ``representational desingularization'' constructs a new geometric landscape for embeddings. We prove a formal theorem guaranteeing the geometric regularization of this new space, showing that the original pathologies are resolved. Finally, we outline the architectural implications of our framework, arguing for a paradigm shift from static look-ups to dynamic, geometrically-grounded computation.

IROct 23, 2024
NexusIndex: Integrating Advanced Vector Indexing and Multi-Model Embeddings for Robust Fake News Detection

Solmaz Seyed Monir, Dongfang Zhao · uw

The proliferation of fake news on digital platforms has underscored the need for robust and scalable detection mechanisms. Traditional methods often fall short in handling large and diverse datasets due to limitations in scalability and accuracy. In this paper, we propose NexusIndex, a novel framework and model that enhances fake news detection by integrating advanced language models, an innovative FAISSNexusIndex layer, and attention mechanisms. Our approach leverages multi-model embeddings to capture rich contextual and semantic nuances, significantly improving text interpretation and classification accuracy. By transforming articles into high-dimensional embeddings and indexing them efficiently, NexusIndex facilitates rapid similarity searches across extensive collections of news articles. The FAISSNexusIndex layer further optimizes this process, enabling real-time detection and enhancing the system's scalability and performance. Our experimental results demonstrate that NexusIndex outperforms state-of-the-art methods in efficiency and accuracy across diverse datasets.

LGNov 30, 2024
Approximate Fiber Product: A Preliminary Algebraic-Geometric Perspective on Multimodal Embedding Alignment

Dongfang Zhao

Multimodal tasks, such as image-text retrieval and generation, require embedding data from diverse modalities into a shared representation space. Aligning embeddings from heterogeneous sources while preserving shared and modality-specific information is a fundamental challenge. This paper provides an initial attempt to integrate algebraic geometry into multimodal representation learning, offering a foundational perspective for further exploration. We model image and text data as polynomials over discrete rings, \( \mathbb{Z}_{256}[x] \) and \( \mathbb{Z}_{|V|}[x] \), respectively, enabling the use of algebraic tools like fiber products to analyze alignment properties. To accommodate real-world variability, we extend the classical fiber product to an approximate fiber product with a tolerance parameter \( ε\), balancing precision and noise tolerance. We study its dependence on \( ε\), revealing asymptotic behavior, robustness to perturbations, and sensitivity to embedding dimensionality. Additionally, we propose a decomposition of the shared embedding space into orthogonal subspaces, \( Z = Z_s \oplus Z_I \oplus Z_T \), where \( Z_s \) captures shared semantics, and \( Z_I \), \( Z_T \) encode modality-specific features. This decomposition is geometrically interpreted via manifolds and fiber bundles, offering insights into embedding structure and optimization. This framework establishes a principled foundation for analyzing multimodal alignment, uncovering connections between robustness, dimensionality allocation, and algebraic structure. It lays the groundwork for further research on embedding spaces in multimodal learning using algebraic geometry.

CRMar 3, 2024
Enhancing Data Provenance and Model Transparency in Federated Learning Systems -- A Database Approach

Michael Gu, Ramasoumya Naraparaju, Dongfang Zhao

Federated Learning (FL) presents a promising paradigm for training machine learning models across decentralized edge devices while preserving data privacy. Ensuring the integrity and traceability of data across these distributed environments, however, remains a critical challenge. The ability to create transparent artificial intelligence, such as detailing the training process of a machine learning model, has become an increasingly prominent concern due to the large number of sensitive (hyper)parameters it utilizes; thus, it is imperative to strike a reasonable balance between openness and the need to protect sensitive information. In this paper, we propose one of the first approaches to enhance data provenance and model transparency in federated learning systems. Our methodology leverages a combination of cryptographic techniques and efficient model management to track the transformation of data throughout the FL process, and seeks to increase the reproducibility and trustworthiness of a trained FL model. We demonstrate the effectiveness of our approach through experimental evaluations on diverse FL scenarios, showcasing its ability to tackle accountability and explainability across the board. Our findings show that our system can greatly enhance data transparency in various FL environments by storing chained cryptographic hashes and client model snapshots in our proposed design for data decoupled FL. This is made possible by also employing multiple optimization techniques which enables comprehensive data provenance without imposing substantial computational loads. Extensive experimental results suggest that integrating a database subsystem into federated learning systems can improve data provenance in an efficient manner, encouraging secure FL adoption in privacy-sensitive applications and paving the way for future advancements in FL transparency and security features.

IRJan 5
MCGI: Manifold-Consistent Graph Indexing for Billion-Scale Disk-Resident Vector Search

Dongfang Zhao

Graph-based Approximate Nearest Neighbor (ANN) search often suffers from performance degradation in high-dimensional spaces due to the ``Euclidean-Geodesic mismatch,'' where greedy routing diverges from the underlying data manifold. To address this, we propose Manifold-Consistent Graph Indexing (MCGI), a geometry-aware and disk-resident indexing method that leverages Local Intrinsic Dimensionality (LID) to dynamically adapt search strategies to the data's intrinsic geometry. Unlike standard algorithms that treat dimensions uniformly, MCGI modulates its beam search budget based on in situ geometric analysis, eliminating dependency on static hyperparameters. Theoretical analysis confirms that MCGI enables improved approximation guarantees by preserving manifold-consistent topological connectivity. Empirically, MCGI achieves 5.8$\times$ higher throughput at 95\% recall on high-dimensional GIST1M compared to state-of-the-art DiskANN. On the billion-scale SIFT1B dataset, MCGI further validates its scalability by reducing high-recall query latency by 3$\times$, while maintaining performance parity on standard lower-dimensional datasets.

IRNov 22, 2025
ProHD: Projection-Based Hausdorff Distance Approximation

Jiuzhou Fu, Luanzheng Guo, Nathan R. Tallent et al.

The Hausdorff distance (HD) is a robust measure of set dissimilarity, but computing it exactly on large, high-dimensional datasets is prohibitively expensive. We propose \textbf{ProHD}, a projection-guided approximation algorithm that dramatically accelerates HD computation while maintaining high accuracy. ProHD identifies a small subset of candidate "extreme" points by projecting the data onto a few informative directions (such as the centroid axis and top principal components) and computing the HD on this subset. This approach guarantees an underestimate of the true HD with a bounded additive error and typically achieves results within a few percent of the exact value. In extensive experiments on image, physics, and synthetic datasets (up to two million points in $D=256$), ProHD runs 10--100$\times$ faster than exact algorithms while attaining 5--20$\times$ lower error than random sampling-based approximations. Our method enables practical HD calculations in scenarios like large vector databases and streaming data, where quick and reliable set distance estimation is needed.

IRSep 30, 2025
RAE: A Neural Network Dimensionality Reduction Method for Nearest Neighbors Preservation in Vector Search

Han Zhang, Dongfang Zhao

While high-dimensional embedding vectors are being increasingly employed in various tasks like Retrieval-Augmented Generation and Recommendation Systems, popular dimensionality reduction (DR) methods such as PCA and UMAP have rarely been adopted for accelerating the retrieval process due to their inability of preserving the nearest neighbor (NN) relationship among vectors. Empowered by neural networks' optimization capability and the bounding effect of Rayleigh quotient, we propose a Regularized Auto-Encoder (RAE) for k-NN preserving dimensionality reduction. RAE constrains the network parameter variation through regularization terms, adjusting singular values to control embedding magnitude changes during reduction, thus preserving k-NN relationships. We provide a rigorous mathematical analysis demonstrating that regularization establishes an upper bound on the norm distortion rate of transformed vectors, thereby offering provable guarantees for k-NN preservation. With modest training overhead, RAE achieves superior k-NN recall compared to existing DR approaches while maintaining fast retrieval efficiency.

IRSep 1, 2025
PIR-RAG: A System for Private Information Retrieval in Retrieval-Augmented Generation

Baiqiang Wang, Qian Lou, Mengxin Zheng et al.

Retrieval-Augmented Generation (RAG) has become a foundational component of modern AI systems, yet it introduces significant privacy risks by exposing user queries to service providers. To address this, we introduce PIR-RAG, a practical system for privacy-preserving RAG. PIR-RAG employs a novel architecture that uses coarse-grained semantic clustering to prune the search space, combined with a fast, lattice-based Private Information Retrieval (PIR) protocol. This design allows for the efficient retrieval of entire document clusters, uniquely optimizing for the end-to-end RAG workflow where full document content is required. Our comprehensive evaluation against strong baseline architectures, including graph-based PIR and Tiptoe-style private scoring, demonstrates PIR-RAG's scalability and its superior performance in terms of "RAG-Ready Latency"-the true end-to-end time required to securely fetch content for an LLM. Our work establishes PIR-RAG as a viable and highly efficient solution for privacy in large-scale AI systems.

CRAug 11, 2025
IPBA: Imperceptible Perturbation Backdoor Attack in Federated Self-Supervised Learning

Jiayao Wang, Yang Song, Zhendong Zhao et al.

Federated self-supervised learning (FSSL) combines the advantages of decentralized modeling and unlabeled representation learning, serving as a cutting-edge paradigm with strong potential for scalability and privacy preservation. Although FSSL has garnered increasing attention, research indicates that it remains vulnerable to backdoor attacks. Existing methods generally rely on visually obvious triggers, which makes it difficult to meet the requirements for stealth and practicality in real-world deployment. In this paper, we propose an imperceptible and effective backdoor attack method against FSSL, called IPBA. Our empirical study reveals that existing imperceptible triggers face a series of challenges in FSSL, particularly limited transferability, feature entanglement with augmented samples, and out-of-distribution properties. These issues collectively undermine the effectiveness and stealthiness of traditional backdoor attacks in FSSL. To overcome these challenges, IPBA decouples the feature distributions of backdoor and augmented samples, and introduces Sliced-Wasserstein distance to mitigate the out-of-distribution properties of backdoor samples, thereby optimizing the trigger generation process. Our experimental results on several FSSL scenarios and datasets show that IPBA significantly outperforms existing backdoor attack methods in performance and exhibits strong robustness under various defense mechanisms.

DBAug 9, 2025
Balancing Privacy and Efficiency: Music Information Retrieval via Additive Homomorphic Encryption

William Zerong Wang, Dongfang Zhao

In the era of generative AI, ensuring the privacy of music data presents unique challenges: unlike static artworks such as images, music data is inherently temporal and multimodal, and it is sampled, transformed, and remixed at an unprecedented scale. These characteristics make its core vector embeddings, i.e, the numerical representations of the music, highly susceptible to being learned, misused, or even stolen by models without accessing the original audio files. Traditional methods like copyright licensing and digital watermarking offer limited protection for these abstract mathematical representations, thus necessitating a stronger, e.g., cryptographic, approach to safeguarding the embeddings themselves. Standard encryption schemes, such as AES, render data unintelligible for computation, making such searches impossible. While Fully Homomorphic Encryption (FHE) provides a plausible solution by allowing arbitrary computations on ciphertexts, its substantial performance overhead remains impractical for large-scale vector similarity searches. Given this trade-off, we propose a more practical approach using Additive Homomorphic Encryption (AHE) for vector similarity search. The primary contributions of this paper are threefold: we analyze threat models unique to music information retrieval systems; we provide a theoretical analysis and propose an efficient AHE-based solution through inner products of music embeddings to deliver privacy-preserving similarity search; and finally, we demonstrate the efficiency and practicality of the proposed approach through empirical evaluation and comparison to FHE schemes on real-world MP3 files.

CRDec 18, 2024
Nemesis: Noise-randomized Encryption with Modular Efficiency and Secure Integration in Machine Learning Systems

Dongfang Zhao

Machine learning (ML) systems that guarantee security and privacy often rely on Fully Homomorphic Encryption (FHE) as a cornerstone technique, enabling computations on encrypted data without exposing sensitive information. However, a critical limitation of FHE is its computational inefficiency, making it impractical for large-scale applications. In this work, we propose \textit{Nemesis}, a framework that accelerates FHE-based systems without compromising accuracy or security. The design of Nemesis is inspired by Rache (SIGMOD'23), which introduced a caching mechanism for encrypted integers and scalars. Nemesis extends this idea with more advanced caching techniques and mathematical tools, enabling efficient operations over multi-slot FHE schemes and overcoming Rache's limitations to support general plaintext structures. We formally prove the security of Nemesis under standard cryptographic assumptions and evaluate its performance extensively on widely used datasets, including MNIST, FashionMNIST, and CIFAR-10. Experimental results show that Nemesis significantly reduces the computational overhead of FHE-based ML systems, paving the way for broader adoption of privacy-preserving technologies.

LGDec 14, 2024
Ares: Approximate Representations via Efficient Sparsification -- A Stateless Approach through Polynomial Homomorphism

Dongfang Zhao

The increasing prevalence of high-dimensional data demands efficient and scalable compression methods to support modern applications. However, existing techniques like PCA and Autoencoders often rely on auxiliary metadata or intricate architectures, limiting their practicality for streaming or infinite datasets. In this paper, we introduce a stateless compression framework that leverages polynomial representations to achieve compact, interpretable, and scalable data reduction. By eliminating the need for auxiliary data, our method supports direct algebraic operations in the compressed domain while minimizing error growth during computations. Through extensive experiments on synthetic and real-world datasets, we show that our approach achieves high compression ratios without compromising reconstruction accuracy, all while maintaining simplicity and scalability.

CRDec 26, 2023
Smuche: Scalar-Multiplicative Caching in Homomorphic Encryption

Dongfang Zhao

Addressing the challenge of balancing security and efficiency when deploying machine learning systems in untrusted environments, such as federated learning, remains a critical concern. A promising strategy to tackle this issue involves optimizing the performance of fully homomorphic encryption (HE). Recent research highlights the efficacy of advanced caching techniques, such as Rache, in significantly enhancing the performance of HE schemes without compromising security. However, Rache is constrained by an inherent limitation: its performance overhead is heavily influenced by the characteristics of plaintext models, specifically exhibiting a caching time complexity of $\mathcal{O}(N)$, where $N$ represents the number of cached pivots based on specific radixes. This caching overhead becomes impractical for handling large-scale data. In this study, we introduce a novel \textit{constant-time} caching technique that is independent of any parameters. The core concept involves applying scalar multiplication to a single cached ciphertext, followed by the introduction of a completely new and constant-time randomness. Leveraging the inherent characteristics of constant-time construction, we coin the term ``Smuche'' for this innovative caching technique, which stands for Scalar-multiplicative Caching of Homomorphic Encryption. We implemented Smuche from scratch and conducted comparative evaluations against two baseline schemes, Rache and CKKS. Our experimental results underscore the effectiveness of Smuche in addressing the identified limitations and optimizing the performance of homomorphic encryption in practical scenarios.

CRJan 12, 2022
Rache: Radix-additive caching for homomorphic encryption

Dongfang Zhao

One of the biggest concerns for many applications in cloud computing lies in data privacy. A potential solution to this problem is homomorphic encryption (HE), which supports certain operations directly over the ciphertexts. Conventional HE schemes, however, exhibit significant performance overhead and are hardly applicable to real-world applications. This paper presents Rache, a caching optimization for accelerating the performance of HE schemes. The key insights of Rache include (i) caching some homomorphic ciphertexts before encrypting the large volume of plaintexts; (ii) expanding the plaintexts into a summation of powers of radixes; and (iii) constructing the ciphertexts with only homomorphic addition. The extensive evaluation shows that Rache exhibits almost linear scalability and outperforms Paillier by orders of magnitude.

LGNov 27, 2021
Nonparametric Topological Layers in Neural Networks

Dongfang Zhao

Various topological techniques and tools have been applied to neural networks in terms of network complexity, explainability, and performance. One fundamental assumption of this line of research is the existence of a global (Euclidean) coordinate system upon which the topological layer is constructed. Despite promising results, such a \textit{topologization} method has yet to be widely adopted because the parametrization of a topologization layer takes a considerable amount of time and more importantly, lacks a theoretical foundation without which the performance of the neural network only achieves suboptimal performance. This paper proposes a learnable topological layer for neural networks without requiring a Euclidean space; Instead, the proposed construction requires nothing more than a general metric space except for an inner product, i.e., a Hilbert space. Accordingly, the according parametrization for the proposed topological layer is free of user-specified hyperparameters, which precludes the costly parametrization stage and the corresponding possibility of suboptimal networks.

CRNov 19, 2021
INCHE: High-Performance Encoding for Relational Databases through Incrementally Homomorphic Encryption

Dongfang Zhao

Homomorphic encryption (HE) offers data confidentiality by executing queries directly on encrypted fields in the database-as-a-service (DaaS) paradigm. While fully HE exhibits great expressiveness but prohibitive performance overhead, a better balance between flexibility and efficiency can be achieved by partially HE schemes. Performance-wise, however, the encryption rate of state-of-the-art HE schemes is still orders of magnitude lower than the I/O throughput, rendering the HE scheme the performance bottleneck. This paper proposes INCHE, an incrementally homomorphic encryption scheme, which aims to boost the performance of HE schemes by incrementally encrypting fields in relational databases. The key idea of INCHE is to explore the intrinsic correlation between plaintexts and cache them for future reuse such that expensive HE primitives from plaintexts to ciphertexts are avoided. We prove the semantic security of INCHE under the chosen-plaintext attack (CPA) model and show that its time complexity is linear in the plaintext length. We implement an INCHE prototype by extending the Symmetria cryptosystem and verify its effectiveness on both randomly-generated data and the TPC-H benchmark.

DCAug 19, 2020
An Algebraic-Topological Approach to Processing Cross-Blockchain Transactions

Dongfang Zhao

The state-of-the-art techniques for processing cross-blockchain transactions take a simple centralized approach: when the assets on blockchain $X$, say $X$-coins, are exchanged with the assets on blockchain $Y$---the $Y$-coins, those $X$-coins need to be exchanged to a "middle" medium (such as Bitcoin) that is then exchanged to $Y$-coins. If there are more than two parties involved in a single global transaction, the global transaction is split into multiple local two-party transactions, each of which follows the above central-exchange protocol. Unfortunately, the atomicity of the global transaction is violated with the central-exchange approach: those local two-party transactions, once committed, cannot be rolled back if the global transaction decides to abort. In a more general sense, the graph-based model of (two-party) transactions can hardly be extended to an arbitrary number of parties in a cross-blockchain transaction. %from why to how In this paper, we introduce a higher-level abstraction of cross-blockchain transactions. We adopt the \textit{abstract simplicial complex}, an extensively-studied mathematical object in algebraic topology, to represent an arbitrary number of parties involved in the blockchain transactions. Essentially, each party in the global transaction is modeled as a vertex and the global transaction among $n+1$ ($n \in \mathbb{Z}$, $n > 0$) parties compose a $n$-dimensional simplex. While this higher-level abstraction seems plausibly trivial, we will show how this simple extension leads to a new line of modeling methods and protocols for better processing cross-blockchain transactions.

CRMay 5, 2020
Toward Equilibria and Solvability of Blockchain Pooling Strategies: A Topological Approach

Dongfang Zhao

In 2015, Eyal proposed the first game-theoretical model for analyzing the equilibrium of blockchain pooling: when the blockchain pools are abstracted as a non-cooperative game, two pools can reach a Nash equilibrium with a closed-form formula; Moreover, an arbitrary number of pools still exhibit an equilibrium as long as the pools have an equal number of miners. Nevertheless, whether an equilibrium exists for three or more pools of distinct sizes remains an open problem. To this end, this paper studies the equilibrium in a blockchain of arbitrary pools. First, we show that the equilibrium among $q$ identical pools, coinciding the result demonstrated by Eyal through game theory, can be constructed using a topological approach. Second, if the pools are of different size, we show that (i) if the blockchain's pools exhibit two distinct sizes, an equilibrium can be reached, and (ii) if the blockchain has at least three distinct pool sizes, there does not exist an equilibrium.

CRApr 1, 2020
Topological Properties of Multi-Party Blockchain Transactions

Dongfang Zhao

The cross-blockchain transaction remains one of the most challenging problems in blockchains. The root cause of the challenge lies in the nondeterministic nature of blockchains: A $n$-party transaction across multiple blockchains might be partially rolled back due to the potential forks in any of the participating blockchains---eventually, only one fork will survive in the competition among miners. While some effort has recently been made to developing hierarchically distributed commit protocols to make multi-party transactions progress, there is no systematic method to reason about the transaction outcome. This paper tackles this problem from a perspective of point-set topology. We construct multiple topological spaces for the transactions and blockchain forks, and show that these spaces are internally related through either homeomorphism or continuous functions. Combined together, these tools allow us to reason about the cross-blockchain transactions through the growing-fork topology, an intuitive representation of blockchains.

GTJan 12, 2020
Permissioned Blockchain Revisited: A Byzantine Game-Theoretical Perspective

Dongfang Zhao

Despite the popularity and practical applicability of blockchains, there is very limited work on the theoretical foundation of blockchains: The lack of rigorous theory and analysis behind the curtain of blockchains has severely staggered its broader applications. This paper attempts to lay out a theoretical foundation for a specific type of blockchains---the ones requiring basic authenticity from the participants, also called \textit{permissioned blockchain}. We formulate permissioned blockchain systems and operations into a game-theoretical problem by incorporating constraints implied by the wisdom from distributed computing and Byzantine systems. We show that in a noncooperative blockchain game (NBG), a Nash equilibrium can be efficiently found in a closed-form even though the game involves more than two players. Somewhat surprisingly, the simulation results of the Nash equilibrium implies that the game can reach a stable status regardless of the number of Byzantine nodes and trustworthy players. We then study a harder problem where players are allowed to form coalitions: the coalitional blockchain game (CBG). We show that although the Shapley value for a CBG can be expressed in a more succinct form, its core is empty.

AIDec 20, 2019
The Blockchain Game: Synthesis of Byzantine Systems and Nash Equilibria

Dongfang Zhao

This position paper presents a synthesis viewpoint of blockchains from two orthogonal perspectives: fault-tolerant distributed systems and game theory. Specifically, we formulate a new game-theoretical problem in the context of blockchains and sketch a closed-form Nash equilibrium to the problem.