CRJan 22Code
NOIR: Privacy-Preserving Generation of Code with Open-Source LLMsKhoa Nguyen, Khiem Ton, NhatHai Phan et al.
Although boosting software development performance, large language model (LLM)-powered code generation introduces intellectual property and data security risks rooted in the fact that a service provider (cloud) observes a client's prompts and generated code, which can be proprietary in commercial systems. To mitigate this problem, we propose NOIR, the first framework to protect the client's prompts and generated code from the cloud. NOIR uses an encoder and a decoder at the client to encode and send the prompts' embeddings to the cloud to get enriched embeddings from the LLM, which are then decoded to generate the code locally at the client. Since the cloud can use the embeddings to infer the prompt and the generated code, NOIR introduces a new mechanism to achieve indistinguishability, a local differential privacy protection at the token embedding level, in the vocabulary used in the prompts and code, and a data-independent and randomized tokenizer on the client side. These components effectively defend against reconstruction and frequency analysis attacks by an honest-but-curious cloud. Extensive analysis and results using open-source LLMs show that NOIR significantly outperforms existing baselines on benchmarks, including the Evalplus (MBPP and HumanEval, Pass@1 of 76.7 and 77.4), and BigCodeBench (Pass@1 of 38.7, only a 1.77% drop from the original LLM) under strong privacy against attacks.
LGAug 10, 2024Code
SAMSA: Efficient Transformer for Many Data ModalitiesMinh Lenhat, Viet Anh Nguyen, Khoa Nguyen et al.
The versatility of self-attention mechanism earned transformers great success in almost all data modalities, with limitations on the quadratic complexity and difficulty of training. Efficient transformers, on the other hand, often rely on clever data-modality-dependent construction to get over the quadratic complexity of transformers. This greatly hinders their applications on different data modalities, which is one of the pillars of contemporary foundational modeling. In this paper, we lay the groundwork for efficient foundational modeling by proposing SAMSA - SAMpling-Self-Attention, a context-aware linear complexity self-attention mechanism that works well on multiple data modalities. Our mechanism is based on a differentiable sampling without replacement method we discovered. This enables the self-attention module to attend to the most important token set, where the importance is defined by data. Moreover, as differentiability is not needed in inference, the sparse formulation of our method costs little time overhead, further lowering computational costs. In short, SAMSA achieved competitive or even SOTA results on many benchmarks, while being faster in inference, compared to other very specialized models. Against full self-attention, real inference time significantly decreases while performance ranges from negligible degradation to outperformance. We release our source code in the repository: https://github.com/HySonLab/SAMSA
CRJan 20, 2023
Split Ways: Privacy-Preserving Training of Encrypted Data Using Split LearningTanveer Khan, Khoa Nguyen, Antonis Michalas
Split Learning (SL) is a new collaborative learning technique that allows participants, e.g. a client and a server, to train machine learning models without the client sharing raw data. In this setting, the client initially applies its part of the machine learning model on the raw data to generate activation maps and then sends them to the server to continue the training process. Previous works in the field demonstrated that reconstructing activation maps could result in privacy leakage of client data. In addition to that, existing mitigation techniques that overcome the privacy leakage of SL prove to be significantly worse in terms of accuracy. In this paper, we improve upon previous works by constructing a protocol based on U-shaped SL that can operate on homomorphically encrypted data. More precisely, in our approach, the client applies Homomorphic Encryption (HE) on the activation maps before sending them to the server, thus protecting user privacy. This is an important improvement that reduces privacy leakage in comparison to other SL-based works. Finally, our results show that, with the optimum set of parameters, training with HE data in the U-shaped SL setting only reduces accuracy by 2.65% compared to training on plaintext. In addition, raw training data privacy is preserved.
CRSep 19, 2023
Love or Hate? Share or Split? Privacy-Preserving Training Using Split Learning and Homomorphic EncryptionTanveer Khan, Khoa Nguyen, Antonis Michalas et al.
Split learning (SL) is a new collaborative learning technique that allows participants, e.g. a client and a server, to train machine learning models without the client sharing raw data. In this setting, the client initially applies its part of the machine learning model on the raw data to generate activation maps and then sends them to the server to continue the training process. Previous works in the field demonstrated that reconstructing activation maps could result in privacy leakage of client data. In addition to that, existing mitigation techniques that overcome the privacy leakage of SL prove to be significantly worse in terms of accuracy. In this paper, we improve upon previous works by constructing a protocol based on U-shaped SL that can operate on homomorphically encrypted data. More precisely, in our approach, the client applies homomorphic encryption on the activation maps before sending them to the server, thus protecting user privacy. This is an important improvement that reduces privacy leakage in comparison to other SL-based works. Finally, our results show that, with the optimum set of parameters, training with HE data in the U-shaped SL setting only reduces accuracy by 2.65% compared to training on plaintext. In addition, raw training data privacy is preserved.
CRAug 30, 2023
Split Without a Leak: Reducing Privacy Leakage in Split LearningKhoa Nguyen, Tanveer Khan, Antonis Michalas
The popularity of Deep Learning (DL) makes the privacy of sensitive data more imperative than ever. As a result, various privacy-preserving techniques have been implemented to preserve user data privacy in DL. Among various privacy-preserving techniques, collaborative learning techniques, such as Split Learning (SL) have been utilized to accelerate the learning and prediction process. Initially, SL was considered a promising approach to data privacy. However, subsequent research has demonstrated that SL is susceptible to many types of attacks and, therefore, it cannot serve as a privacy-preserving technique. Meanwhile, countermeasures using a combination of SL and encryption have also been introduced to achieve privacy-preserving deep learning. In this work, we propose a hybrid approach using SL and Homomorphic Encryption (HE). The idea behind it is that the client encrypts the activation map (the output of the split layer between the client and the server) before sending it to the server. Hence, during both forward and backward propagation, the server cannot reconstruct the client's input data from the intermediate activation map. This improvement is important as it reduces privacy leakage compared to other SL-based works, where the server can gain valuable information about the client's input. In addition, on the MIT-BIH dataset, our proposed hybrid approach using SL and HE yields faster training time (about 6 times) and significantly reduced communication overhead (almost 160 times) compared to other HE-based approaches, thereby offering improved privacy protection for sensitive data in DL.
86.9INS-DETMar 27
Material Identification using Multi-Modal Intrinsic Radiation and RadiographyKhoa Nguyen, Brendt Wohlberg, Oleg Korobkin et al.
We investigate multi-modal material identification for special nuclear material (SNM) configurations using a combination of X-ray radiography, high-resolution γ-ray spectroscopy, and neutron multiplicity measurements. We consider a Beryllium Reflected Plutonium sphere (BeRP) ball surrounded by one or two concentric shielding shells of unknown composition whose radii are assumed known from radiography. High-purity germanium (HPGe) spectra are reduced to net counts in selected Pu-239 photo-peaks, while neutron multiplicity information is summarized by Feynman variances Y2 and Y3 computed from factorial moments of the neutron counting statistics. Using synthetic data generated with the Gamma Detector Response and Analysis Software (GADRAS) for a range of shielding materials and thicknesses, we cast the material identification problem as a supervised multi-class classification task over all admissible shell-material combinations. We demonstrate that a random forest classifier trained on combined gamma and neutron features achieves almost perfect identification accuracy for single-shell cases, and substantial performance gains for more challenging double-shell configurations relative to gamma-only classification. Alternative statistical and machine-learning formulations for this multi-class problem are examined along with examination of the impact of model-mismatch between the forward model and the test cases as given by variations in the statistical noise. Opportunities for extending the approach to more complex geometries and experimental data are also discussed.
83.7SEApr 20
Program Structure-aware Language Models: Targeted Software Testing beyond Textual SemanticsKhang Tran, Khoa Nguyen, Cristian Borcea et al.
Recent advances in large language models for test case generation have improved branch coverage via prompt-engineered mutations. However, they still lack principled mechanisms for steering models toward specific high-risk execution branches, limiting their effectiveness for discovering subtle bugs and security vulnerabilities. We propose GLMTest, the first program structure-aware LLM framework for targeted test case generation that seamlessly integrates code property graphs and code semantics using a graph neural network and a language model to condition test case generation on execution branches. This structured conditioning enables controllable and branch-targeted test case generation, thereby potentially enhancing bug and security risk discovery. Experiments on real-world projects show that GLMTest built on a Qwen2.5-Coder-7B-Instruct model improves branch accuracy from 27.4% to 50.2% on TestGenEval benchmark compared with state-of-the-art LLMs, i.e., Claude-Sonnet-4.5 and GPT-4o-mini.
LGAug 11, 2024
Sampling Foundational Transformer: A Theoretical PerspectiveViet Anh Nguyen, Minh Lenhat, Khoa Nguyen et al.
The versatility of self-attention mechanism earned transformers great success in almost all data modalities, with limitations on the quadratic complexity and difficulty of training. To apply transformers across different data modalities, practitioners have to make specific clever data-modality-dependent constructions. In this paper, we propose Sampling Foundational Transformer (SFT) that can work on multiple data modalities (e.g., point cloud, graph, and sequence) and constraints (e.g., rotational-invariant). The existence of such model is important as contemporary foundational modeling requires operability on multiple data sources. For efficiency on large number of tokens, our model relies on our context aware sampling-without-replacement mechanism for both linear asymptotic computational complexity and real inference time gain. For efficiency, we rely on our newly discovered pseudoconvex formulation of transformer layer to increase model's convergence rate. As a model working on multiple data modalities, SFT has achieved competitive results on many benchmarks, while being faster in inference, compared to other very specialized models.
CRMar 6, 2024Code
Wildest Dreams: Reproducible Research in Privacy-preserving Neural Network TrainingTanveer Khan, Mindaugas Budzys, Khoa Nguyen et al.
Machine Learning (ML), addresses a multitude of complex issues in multiple disciplines, including social sciences, finance, and medical research. ML models require substantial computing power and are only as powerful as the data utilized. Due to high computational cost of ML methods, data scientists frequently use Machine Learning-as-a-Service (MLaaS) to outsource computation to external servers. However, when working with private information, like financial data or health records, outsourcing the computation might result in privacy issues. Recent advances in Privacy-Preserving Techniques (PPTs) have enabled ML training and inference over protected data through the use of Privacy-Preserving Machine Learning (PPML). However, these techniques are still at a preliminary stage and their application in real-world situations is demanding. In order to comprehend discrepancy between theoretical research suggestions and actual applications, this work examines the past and present of PPML, focusing on Homomorphic Encryption (HE) and Secure Multi-party Computation (SMPC) applied to ML. This work primarily focuses on the ML model's training phase, where maintaining user data privacy is of utmost importance. We provide a solid theoretical background that eases the understanding of current approaches and their limitations. In addition, we present a SoK of the most recent PPML frameworks for model training and provide a comprehensive comparison in terms of the unique properties and performances on standard benchmarks. Also, we reproduce the results for some of the papers and examine at what level existing works in the field provide support for open science. We believe our work serves as a valuable contribution by raising awareness about the current gap between theoretical advancements and real-world applications in PPML, specifically regarding open-source availability, reproducibility, and usability.
32.8LGMar 21
Neural Autoregressive Flows for Markov Boundary LearningKhoa Nguyen, Bao Duong, Viet Huynh et al.
Recovering Markov boundary -- the minimal set of variables that maximizes predictive performance for a response variable -- is crucial in many applications. While recent advances improve upon traditional constraint-based techniques by scoring local causal structures, they still rely on nonparametric estimators and heuristic searches, lacking theoretical guarantees for reliability. This paper investigates a framework for efficient Markov boundary discovery by integrating conditional entropy from information theory as a scoring criterion. We design a novel masked autoregressive network to capture complex dependencies. A parallelizable greedy search strategy in polynomial time is proposed, supported by analytical evidence. We also discuss how initializing a graph with learned Markov boundaries accelerates the convergence of causal discovery. Comprehensive evaluations on real-world and synthetic datasets demonstrate the scalability and superior performance of our method in both Markov boundary discovery and causal discovery tasks.
43.2CVMay 8
Towards multi-modal forgery representation learning for AI-generated video detection and localizationDat Le, Khoa Nguyen, Xin Wang et al.
Recent advances in generative AI have democratized video creation at scale. AI-generated videos, including partially manipulated clips across visual and audio channels, pose escalating risks of semantic distortion and misuse, which motivates the need for reliable detection tools. Most existing AI-generated video detectors remain limited by single- or partial-modality of data modeling and the lack of fine-grained temporal forgery localization. To address these challenges, our primary novelty introduces a core architecture that jointly integrates an LMM semantic branch with a spatio-temporal (ST) visual branch and a multi-scale partial-spoof (PS) audio branch. This multi-modal approach enables simultaneous detection and fine-grained temporal localization of partially manipulated AI-generated video forgeries. Extensive experiments show that this approach outperforms existing state-of-the-art methods.
ARJul 23, 2025
FedChip: Federated LLM for Artificial Intelligence Accelerator Chip DesignMahmoud Nazzal, Khoa Nguyen, Deepak Vungarala et al.
AI hardware design is advancing rapidly, driven by the promise of design automation to make chip development faster, more efficient, and more accessible to a wide range of users. Amongst automation tools, Large Language Models (LLMs) offer a promising solution by automating and streamlining parts of the design process. However, their potential is hindered by data privacy concerns and the lack of domain-specific training. To address this, we introduce FedChip, a Federated fine-tuning approach that enables multiple Chip design parties to collaboratively enhance a shared LLM dedicated for automated hardware design generation while protecting proprietary data. FedChip enables parties to train the model on proprietary local data and improve the shared LLM's performance. To exemplify FedChip's deployment, we create and release APTPU-Gen, a dataset of 30k design variations spanning various performance metric values such as power, performance, and area (PPA). To encourage the LLM to generate designs that achieve a balance across multiple quality metrics, we propose a new design evaluation metric, Chip@k, which statistically evaluates the quality of generated designs against predefined acceptance criteria. Experimental results show that FedChip improves design quality by more than 77% over high-end LLMs while maintaining data privacy
LGOct 27, 2025
SGFusion: Stochastic Geographic Gradient Fusion in Federated LearningKhoa Nguyen, Khang Tran, NhatHai Phan et al.
This paper proposes Stochastic Geographic Gradient Fusion (SGFusion), a novel training algorithm to leverage the geographic information of mobile users in Federated Learning (FL). SGFusion maps the data collected by mobile devices onto geographical zones and trains one FL model per zone, which adapts well to the data and behaviors of users in that zone. SGFusion models the local data-based correlation among geographical zones as a hierarchical random graph (HRG) optimized by Markov Chain Monte Carlo sampling. At each training step, every zone fuses its local gradient with gradients derived from a small set of other zones sampled from the HRG. This approach enables knowledge fusion and sharing among geographical zones in a probabilistic and stochastic gradient fusion process with self-attention weights, such that "more similar" zones have "higher probabilities" of sharing gradients with "larger attention weights." SGFusion remarkably improves model utility without introducing undue computational cost. Extensive theoretical and empirical results using a heart-rate prediction dataset collected across 6 countries show that models trained with SGFusion converge with upper-bounded expected errors and significantly improve utility in all countries compared to existing approaches without notable cost in system scalability.
LGAug 11, 2025
Sparse Partial Optimal Transport via Quadratic RegularizationKhang Tran, Khoa Nguyen, Anh Nguyen et al.
Partial Optimal Transport (POT) has recently emerged as a central tool in various Machine Learning (ML) applications. It lifts the stringent assumption of the conventional Optimal Transport (OT) that input measures are of equal masses, which is often not guaranteed in real-world datasets, and thus offers greater flexibility by permitting transport between unbalanced input measures. Nevertheless, existing major solvers for POT commonly rely on entropic regularization for acceleration and thus return dense transport plans, hindering the adoption of POT in various applications that favor sparsity. In this paper, as an alternative approach to the entropic POT formulation in the literature, we propose a novel formulation of POT with quadratic regularization, hence termed quadratic regularized POT (QPOT), which induces sparsity to the transport plan and consequently facilitates the adoption of POT in many applications with sparsity requirements. Extensive experiments on synthetic and CIFAR-10 datasets, as well as real-world applications such as color transfer and domain adaptations, consistently demonstrate the improved sparsity and favorable performance of our proposed QPOT formulation.
CRJul 20, 2025
A Privacy-Centric Approach: Scalable and Secure Federated Learning Enabled by Hybrid Homomorphic EncryptionKhoa Nguyen, Tanveer Khan, Hossein Abdinasibfar et al.
Federated Learning (FL) enables collaborative model training without sharing raw data, making it a promising approach for privacy-sensitive domains. Despite its potential, FL faces significant challenges, particularly in terms of communication overhead and data privacy. Privacy-preserving Techniques (PPTs) such as Homomorphic Encryption (HE) have been used to mitigate these concerns. However, these techniques introduce substantial computational and communication costs, limiting their practical deployment. In this work, we explore how Hybrid Homomorphic Encryption (HHE), a cryptographic protocol that combines symmetric encryption with HE, can be effectively integrated with FL to address both communication and privacy challenges, paving the way for scalable and secure decentralized learning system.
LGMar 8, 2025
Clustering-based Meta Bayesian Optimization with Theoretical GuaranteeKhoa Nguyen, Viet Huynh, Binh Tran et al.
Bayesian Optimization (BO) is a well-established method for addressing black-box optimization problems. In many real-world scenarios, optimization often involves multiple functions, emphasizing the importance of leveraging data and learned functions from prior tasks to enhance efficiency in the current task. To expedite convergence to the global optimum, recent studies have introduced meta-learning strategies, collectively referred to as meta-BO, to incorporate knowledge from historical tasks. However, in practical settings, the underlying functions are often heterogeneous, which can adversely affect optimization performance for the current task. Additionally, when the number of historical tasks is large, meta-BO methods face significant scalability challenges. In this work, we propose a scalable and robust meta-BO method designed to address key challenges in heterogeneous and large-scale meta-tasks. Our approach (1) effectively partitions transferred meta-functions into highly homogeneous clusters, (2) learns the geometry-based surrogate prototype that capture the structural patterns within each cluster, and (3) adaptively synthesizes meta-priors during the online phase using statistical distance-based weighting policies. Experimental results on real-world hyperparameter optimization (HPO) tasks, combined with theoretical guarantees, demonstrate the robustness and effectiveness of our method in overcoming these challenges.
LGJan 26, 2024
GuardML: Efficient Privacy-Preserving Machine Learning Services Through Hybrid Homomorphic EncryptionEugene Frimpong, Khoa Nguyen, Mindaugas Budzys et al.
Machine Learning (ML) has emerged as one of data science's most transformative and influential domains. However, the widespread adoption of ML introduces privacy-related concerns owing to the increasing number of malicious attacks targeting ML models. To address these concerns, Privacy-Preserving Machine Learning (PPML) methods have been introduced to safeguard the privacy and security of ML models. One such approach is the use of Homomorphic Encryption (HE). However, the significant drawbacks and inefficiencies of traditional HE render it impractical for highly scalable scenarios. Fortunately, a modern cryptographic scheme, Hybrid Homomorphic Encryption (HHE), has recently emerged, combining the strengths of symmetric cryptography and HE to surmount these challenges. Our work seeks to introduce HHE to ML by designing a PPML scheme tailored for end devices. We leverage HHE as the fundamental building block to enable secure learning of classification outcomes over encrypted data, all while preserving the privacy of the input data and ML model. We demonstrate the real-world applicability of our construction by developing and evaluating an HHE-based PPML application for classifying heart disease based on sensitive ECG data. Notably, our evaluations revealed a slight reduction in accuracy compared to inference on plaintext data. Additionally, both the analyst and end devices experience minimal communication and computation costs, underscoring the practical viability of our approach. The successful integration of HHE into PPML provides a glimpse into a more secure and privacy-conscious future for machine learning on relatively constrained end devices.
ASJul 6, 2020
Temporal Sub-sampling of Audio Feature Sequences for Automated Audio CaptioningKhoa Nguyen, Konstantinos Drossos, Tuomas Virtanen
Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio signal, for example 10 words versus some thousands of audio feature vectors. This clearly indicates that an output word corresponds to multiple input feature vectors. In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence. We employ a sequence-to-sequence method, which uses a fixed-length vector as an output from the encoder, and we apply temporal sub-sampling between the RNNs of the encoder. We evaluate the benefit of our approach by employing the freely available dataset Clotho and we evaluate the impact of different factors of temporal sub-sampling. Our results show an improvement to all considered metrics.
CRJun 30, 2020
Traceable Policy-Based Signatures and Instantiation from LatticesYanhong Xu, Reihaneh Safavi-Naini, Khoa Nguyen et al.
Policy-based signatures (PBS) were proposed by Bellare and Fuchsbauer (PKC 2014) to allow an {\em authorized} member of an organization to sign a message on behalf of the organization. The user's authorization is determined by a policy managed by the organization's trusted authority, while the signature preserves the privacy of the organization's policy. Signing keys in PBS do not include user identity information and thus can be passed to others, violating the intention of employing PBS to restrict users' signing capability. In this paper, we introduce the notion of {\em traceability} for PBS by including user identity in the signing key such that the trusted authority will be able to open a suspicious signature and recover the signer's identity should the needs arise. We provide rigorous definitions and stringent security notions of traceable PBS (TPBS), capturing the properties of PBS suggested by Bellare-Fuchsbauer and resembling the "full traceability" requirement for group signatures put forward by Bellare-Micciancio-Warinschi (Eurocrypt 2003). As a proof of concept, we provide a modular construction of TPBS, based on a signature scheme, an encryption scheme and a zero-knowledge proof system. Furthermore, to demonstrate the feasibility of achieving TPBS from concrete, quantum-resistant assumptions, we give an instantiation based on lattices.
CRSep 10, 2019
Provably Secure Group Signature Schemes from Code-Based AssumptionsMartianus Frederic Ezerman, Hyung Tae Lee, San Ling et al.
We solve an open question in code-based cryptography by introducing two provably secure group signature schemes from code-based assumptions. Our basic scheme satisfies the CPA-anonymity and traceability requirements in the random oracle model, assuming the hardness of the McEliece problem, the Learning Parity with Noise problem, and a variant of the Syndrome Decoding problem. The construction produces smaller key and signature sizes than the previous group signature schemes from lattices, as long as the cardinality of the underlying group does not exceed $2^{24}$, which is roughly comparable to the current population of the Netherlands. We develop the basic scheme further to achieve the strongest anonymity notion, i.e., CCA-anonymity, with a small overhead in terms of efficiency. The feasibility of two proposed schemes is supported by implementation results. Our two schemes are the first in their respective classes of provably secure groups signature schemes. Additionally, the techniques introduced in this work might be of independent interest. These are a new verifiable encryption protocol for the randomized McEliece encryption and a novel approach to design formal security reductions from the Syndrome Decoding problem.
CRJan 2, 2019
Accountable Tracing Signatures from LatticesSan Ling, Khoa Nguyen, Huaxiong Wang et al.
Group signatures allow users of a group to sign messages anonymously in the name of the group, while incorporating a tracing mechanism to revoke anonymity and identify the signer of any message. Since its introduction by Chaum and van Heyst (EUROCRYPT 1991), numerous proposals have been put forward, yielding various improvements on security, efficiency and functionality. However, a drawback of traditional group signatures is that the opening authority is given too much power, i.e., he can indiscriminately revoke anonymity and there is no mechanism to keep him accountable. To overcome this problem, Kohlweiss and Miers (PoPET 2015) introduced the notion of accountable tracing signatures (ATS) - an enhanced group signature variant in which the opening authority is kept accountable for his actions. Kohlweiss and Miers demonstrated a generic construction of ATS and put forward a concrete instantiation based on number-theoretic assumptions. To the best of our knowledge, no other ATS scheme has been known, and the problem of instantiating ATS under post-quantum assumptions, e.g., lattices, remains open to date. In this work, we provide the first lattice-based accountable tracing signature scheme. The scheme satisfies the security requirements suggested by Kohlweiss and Miers, assuming the hardness of the Ring Short Integer Solution (RSIS) and the Ring Learning With Errors (RLWE) problems. At the heart of our construction are a lattice-based key-oblivious encryption scheme and a zero-knowledge argument system allowing to prove that a given ciphertext is a valid RLWE encryption under some hidden yet certified key. These technical building blocks may be of independent interest, e.g., they can be useful for the design of other lattice-based privacy-preserving protocols.
CRFeb 14, 2018
Zero-Knowledge Password Policy Check from LatticesKhoa Nguyen, Benjamin Hong Meng Tan, Huaxiong Wang
Passwords are ubiquitous and most commonly used to authenticate users when logging into online services. Using high entropy passwords is critical to prevent unauthorized access and password policies emerged to enforce this requirement on passwords. However, with current methods of password storage, poor practices and server breaches have leaked many passwords to the public. To protect one's sensitive information in case of such events, passwords should be hidden from servers. Verifier-based password authenticated key exchange, proposed by Bellovin and Merrit (IEEE S\&P, 1992), allows authenticated secure channels to be established with a hash of a password (verifier). Unfortunately, this restricts password policies as passwords cannot be checked from their verifier. To address this issue, Kiefer and Manulis (ESORICS 2014) proposed zero-knowledge password policy check (ZKPPC). A ZKPPC protocol allows users to prove in zero knowledge that a hash of the user's password satisfies the password policy required by the server. Unfortunately, their proposal is not quantum resistant with the use of discrete logarithm-based cryptographic tools and there are currently no other viable alternatives. In this work, we construct the first post-quantum ZKPPC using lattice-based tools. To this end, we introduce a new randomised password hashing scheme for ASCII-based passwords and design an accompanying zero-knowledge protocol for policy compliance. Interestingly, our proposal does not follow the framework established by Kiefer and Manulis and offers an alternate construction without homomorphic commitments. Although our protocol is not ready to be used in practice, we think it is an important first step towards a quantum-resistant privacy-preserving password-based authentication and key exchange system.
CRJan 26, 2018
Lattice-Based Group Signatures: Achieving Full Dynamicity (and Deniability) with EaseSan Ling, Khoa Nguyen, Huaxiong Wang et al.
In this work, we provide the first lattice-based group signature that offers full dynamicity (i.e., users have the flexibility in joining and leaving the group), and thus, resolve a prominent open problem posed by previous works. Moreover, we achieve this non-trivial feat in a relatively simple manner. Starting with Libert et al.'s fully static construction (Eurocrypt 2016) - which is arguably the most efficient lattice-based group signature to date, we introduce simple-but-insightful tweaks that allow to upgrade it directly into the fully dynamic setting. More startlingly, our scheme even produces slightly shorter signatures than the former, thanks to an adaptation of a technique proposed by Ling et al. (PKC 2013), allowing to prove inequalities in zero-knowledge. Our design approach consists of upgrading Libert et al.'s static construction (EUROCRYPT 2016) - which is arguably the most efficient lattice-based group signature to date - into the fully dynamic setting. Somewhat surprisingly, our scheme produces slightly shorter signatures than the former, thanks to a new technique for proving inequality in zero-knowledge without relying on any inequality check. The scheme satisfies the strong security requirements of Bootle et al.'s model (ACNS 2016), under the Short Integer Solution (SIS) and the Learning With Errors (LWE) assumptions. Furthermore, we demonstrate how to equip the obtained group signature scheme with the deniability functionality in a simple way. This attractive functionality, put forward by Ishida et al. (CANS 2016), enables the tracing authority to provide an evidence that a given user is not the owner of a signature in question. In the process, we design a zero-knowledge protocol for proving that a given LWE ciphertext does not decrypt to a particular message.
CRJan 25, 2018
Forward-Secure Group Signatures from LatticesSan Ling, Khoa Nguyen, Huaxiong Wang et al.
Group signature is a fundamental cryptographic primitive, aiming to protect anonymity and ensure accountability of users. It allows group members to anonymously sign messages on behalf of the whole group, while incorporating a tracing mechanism to identify the signer of any suspected signature. Most of the existing group signature schemes, however, do not guarantee security once secret keys are exposed. To reduce potential damages caused by key exposure attacks, Song (ACMCCS 2001) put forward the concept of forward-secure group signature (FSGS), which prevents attackers from forging group signatures pertaining to past time periods even if a secret group signing key is revealed at the current time period. For the time being, however, all known secure FSGS schemes are based on number-theoretic assumptions, and are vulnerable against quantum computers. In this work, we construct the first lattice-based FSGS scheme. Our scheme is proven secure under the Short Integer Solution and Learning With Errors assumptions. At the heart of our construction is a scalable lattice-based key evolving mechanism, allowing users to periodically update their secret keys and to efficiently prove in zero-knowledge that key evolution process is done correctly. To realize this essential building block, we first employ the Bonsai tree structure by Cash et al. (EUROCRYPT 2010) to handle the key evolution process, and then develop Langlois et al.'s construction (PKC 2014) to design its supporting zero-knowledge protocol.
CRJan 24, 2018
Server-Aided Revocable Predicate Encryption: Formalization and Lattice-Based InstantiationSan Ling, Khoa Nguyen, Huaxiong Wang et al.
Efficient user revocation is a necessary but challenging problem in many multi-user cryptosystems. Among known approaches, server-aided revocation yields a promising solution, because it allows to outsource the major workloads of system users to a computationally powerful third party, called the server, whose only requirement is to carry out the computations correctly. Such a revocation mechanism was considered in the settings of identity-based encryption and attribute-based encryption by Qin et al. (ESORICS 2015) and Cui et al. (ESORICS 2016), respectively. In this work, we consider the server-aided revocation mechanism in the more elaborate setting of predicate encryption (PE). The latter, introduced by Katz, Sahai, and Waters (EUROCRYPT 2008), provides fine-grained and role-based access to encrypted data and can be viewed as a generalization of identity-based and attribute-based encryption. Our contribution is two-fold. First, we formalize the model of server-aided revocable predicate encryption (SR-PE), with rigorous definitions and security notions. Our model can be seen as a non-trivial adaptation of Cui et al.'s work into the PE context. Second, we put forward a lattice-based instantiation of SR-PE. The scheme employs the PE scheme of Agrawal, Freeman and Vaikuntanathan (ASIACRYPT 2011) and the complete subtree method of Naor, Naor, and Lotspiech (CRYPTO 2001) as the two main ingredients, which work smoothly together thanks to a few additional techniques. Our scheme is proven secure in the standard model (in a selective manner), based on the hardness of the Learning With Errors (LWE) problem.