Masatoshi Yoshikawa

h-index26

32papers

894citations

Novelty56%

AI Score34

Ranked #131,621 of 205,806 authors (top 64%)#3,578 in CR (top 49%)

32 Papers

CLApr 27, 2022

BiTimeBERT: Extending Pre-Trained Language Representations with Bi-Temporal Information

Jiexin Wang, Adam Jatowt, Masatoshi Yoshikawa et al.

Time is an important aspect of documents and is used in a range of NLP and IR tasks. In this work, we investigate methods for incorporating temporal information during pre-training to further improve the performance on time-related tasks. Compared with common pre-trained language models like BERT which utilize synchronic document collections (e.g., BookCorpus and Wikipedia) as the training corpora, we use long-span temporal news article collection for building word representations. We introduce BiTimeBERT, a novel language representation model trained on a temporal collection of news articles via two new pre-training tasks, which harnesses two distinct temporal signals to construct time-aware language representations. The experimental results show that BiTimeBERT consistently outperforms BERT and other existing pre-trained models with substantial gains on different downstream NLP tasks and applications for which time is of importance (e.g., the accuracy improvement over BERT is 155\% on the event time estimation task).

LGAug 23, 2023

ULDP-FL: Federated Learning with Across Silo User-Level Differential Privacy

Fumiyuki Kato, Li Xiong, Shun Takagi et al.

Differentially Private Federated Learning (DP-FL) has garnered attention as a collaborative machine learning approach that ensures formal privacy. Most DP-FL approaches ensure DP at the record-level within each silo for cross-silo FL. However, a single user's data may extend across multiple silos, and the desired user-level DP guarantee for such a setting remains unknown. In this study, we present Uldp-FL, a novel FL framework designed to guarantee user-level DP in cross-silo FL where a single user's data may belong to multiple silos. Our proposed algorithm directly ensures user-level DP through per-user weighted clipping, departing from group-privacy approaches. We provide a theoretical analysis of the algorithm's privacy and utility. Additionally, we enhance the utility of the proposed algorithm with an enhanced weighting strategy based on user record distribution and design a novel private protocol that ensures no additional information is revealed to the silos and the server. Experiments on real-world datasets show substantial improvements in our methods in privacy-utility trade-offs under user-level DP compared to baseline methods. To the best of our knowledge, our work is the first FL framework that effectively provides user-level DP in the general cross-silo FL setting.

CVDec 20, 2022

Local Differential Privacy Image Generation Using Flow-based Deep Generative Models

Hisaichi Shibata, Shouhei Hanaoka, Yang Cao et al.

Diagnostic radiologists need artificial intelligence (AI) for medical imaging, but access to medical images required for training in AI has become increasingly restrictive. To release and use medical images, we need an algorithm that can simultaneously protect privacy and preserve pathologies in medical images. To develop such an algorithm, here, we propose DP-GLOW, a hybrid of a local differential privacy (LDP) algorithm and one of the flow-based deep generative models (GLOW). By applying a GLOW model, we disentangle the pixelwise correlation of images, which makes it difficult to protect privacy with straightforward LDP algorithms for images. Specifically, we map images onto the latent vector of the GLOW model, each element of which follows an independent normal distribution, and we apply the Laplace mechanism to the latent vector. Moreover, we applied DP-GLOW to chest X-ray images to generate LDP images while preserving pathologies.

CRApr 8, 2022

Network Shuffling: Privacy Amplification via Random Walks

Seng Pei Liew, Tsubasa Takahashi, Shun Takagi et al.

Recently, it is shown that shuffling can amplify the central differential privacy guarantees of data randomized with local differential privacy. Within this setup, a centralized, trusted shuffler is responsible for shuffling by keeping the identities of data anonymous, which subsequently leads to stronger privacy guarantees for systems. However, introducing a centralized entity to the originally local privacy model loses some appeals of not having any centralized entity as in local differential privacy. Moreover, implementing a shuffler in a reliable way is not trivial due to known security issues and/or requirements of advanced hardware or secure computation technology. Motivated by these practical considerations, we rethink the shuffle model to relax the assumption of requiring a centralized, trusted shuffler. We introduce network shuffling, a decentralized mechanism where users exchange data in a random-walk fashion on a network/graph, as an alternative of achieving privacy amplification via anonymity. We analyze the threat model under such a setting, and propose distributed protocols of network shuffling that is straightforward to implement in practice. Furthermore, we show that the privacy amplification rate is similar to other privacy amplification techniques such as uniform shuffling. To our best knowledge, among the recently studied intermediate trust models that leverage privacy amplification techniques, our work is the first that is not relying on any centralized entity to achieve privacy amplification.

CRMay 13, 2024

HRNet: Differentially Private Hierarchical and Multi-Resolution Network for Human Mobility Data Synthesization

Shun Takagi, Li Xiong, Fumiyuki Kato et al.

Human mobility data offers valuable insights for many applications such as urban planning and pandemic response, but its use also raises privacy concerns. In this paper, we introduce the Hierarchical and Multi-Resolution Network (HRNet), a novel deep generative model specifically designed to synthesize realistic human mobility data while guaranteeing differential privacy. We first identify the key difficulties inherent in learning human mobility data under differential privacy. In response to these challenges, HRNet integrates three components: a hierarchical location encoding mechanism, multi-task learning across multiple resolutions, and private pre-training. These elements collectively enhance the model's ability under the constraints of differential privacy. Through extensive comparative experiments utilizing a real-world dataset, HRNet demonstrates a marked improvement over existing methods in balancing the utility-privacy trade-off.

LGOct 21, 2024

Extracting Spatiotemporal Data from Gradients with Large Language Models

Lele Zheng, Yang Cao, Renhe Jiang et al.

Recent works show that sensitive user data can be reconstructed from gradient updates, breaking the key privacy promise of federated learning. While success was demonstrated primarily on image data, these methods do not directly transfer to other domains, such as spatiotemporal data. To understand privacy risks in spatiotemporal federated learning, we first propose Spatiotemporal Gradient Inversion Attack (ST-GIA), a gradient attack algorithm tailored to spatiotemporal data that successfully reconstructs the original location from gradients. Furthermore, the absence of priors in attacks on spatiotemporal data has hindered the accurate reconstruction of real client data. To address this limitation, we propose ST-GIA+, which utilizes an auxiliary language model to guide the search for potential locations, thereby successfully reconstructing the original data from gradients. In addition, we design an adaptive defense strategy to mitigate gradient inversion attacks in spatiotemporal federated learning. By dynamically adjusting the perturbation levels, we can offer tailored protection for varying rounds of training data, thereby achieving a better trade-off between privacy and utility than current state-of-the-art methods. Through intensive experimental analysis on three real-world datasets, we reveal that the proposed defense strategy can well preserve the utility of spatiotemporal federated learning with effective security protection.

CRFeb 1, 2025

Data Overvaluation Attack and Truthful Data Valuation in Federated Learning

Shuyuan Zheng, Sudong Cai, Chuan Xiao et al.

In collaborative machine learning (CML), data valuation, i.e., evaluating the contribution of each client's data to the machine learning model, has become a critical task for incentivizing and selecting positive data contributions. However, existing studies often assume that clients engage in data valuation truthfully, overlooking the practical motivation for clients to exaggerate their contributions. To unlock this threat, this paper introduces the data overvaluation attack, enabling strategic clients to have their data significantly overvalued in federated learning, a widely adopted paradigm for decentralized CML. Furthermore, we propose a Bayesian truthful data valuation metric, named Truth-Shapley. Truth-Shapley is the unique metric that guarantees some promising axioms for data valuation while ensuring that clients' optimal strategy is to perform truthful data valuation under certain conditions. Our experiments demonstrate the vulnerability of existing data valuation metrics to the proposed attack and validate the robustness and effectiveness of Truth-Shapley.

LGFeb 15, 2022

OLIVE: Oblivious Federated Learning on Trusted Execution Environment against the risk of sparsification

Fumiyuki Kato, Yang Cao, Masatoshi Yoshikawa

Combining Federated Learning (FL) with a Trusted Execution Environment (TEE) is a promising approach for realizing privacy-preserving FL, which has garnered significant academic attention in recent years. Implementing the TEE on the server side enables each round of FL to proceed without exposing the client's gradient information to untrusted servers. This addresses usability gaps in existing secure aggregation schemes as well as utility gaps in differentially private FL. However, to address the issue using a TEE, the vulnerabilities of server-side TEEs need to be considered -- this has not been sufficiently investigated in the context of FL. The main technical contribution of this study is the analysis of the vulnerabilities of TEE in FL and the defense. First, we theoretically analyze the leakage of memory access patterns, revealing the risk of sparsified gradients, which are commonly used in FL to enhance communication efficiency and model accuracy. Second, we devise an inference attack to link memory access patterns to sensitive information in the training dataset. Finally, we propose an oblivious yet efficient aggregation algorithm to prevent memory access pattern leakage. Our experiments on real-world data demonstrate that the proposed method functions efficiently in practical scales.

CLSep 8, 2021

ArchivalQA: A Large-scale Benchmark Dataset for Open Domain Question Answering over Historical News Collections

Jiexin Wang, Adam Jatowt, Masatoshi Yoshikawa

In the last few years, open-domain question answering (ODQA) has advanced rapidly due to the development of deep learning techniques and the availability of large-scale QA datasets. However, the current datasets are essentially designed for synchronic document collections (e.g., Wikipedia). Temporal news collections such as long-term news archives spanning several decades, are rarely used in training the models despite they are quite valuable for our society. To foster the research in the field of ODQA on such historical collections, we present ArchivalQA, a large question answering dataset consisting of 532,444 question-answer pairs which is designed for temporal news QA. We divide our dataset into four subparts based on the question difficulty levels and the containment of temporal expressions, which we believe are useful for training and testing ODQA systems characterized by different strengths and abilities. The novel QA dataset-constructing framework that we introduce can be also applied to generate non-ambiguous questions of good quality over other types of temporal document collections.

CRJun 13, 2021

Understanding the Interplay between Privacy and Robustness in Federated Learning

Yaowei Han, Yang Cao, Masatoshi Yoshikawa

Federated Learning (FL) is emerging as a promising paradigm of privacy-preserving machine learning, which trains an algorithm across multiple clients without exchanging their data samples. Recent works highlighted several privacy and robustness weaknesses in FL and addressed these concerns using local differential privacy (LDP) and some well-studied methods used in conventional ML, separately. However, it is still not clear how LDP affects adversarial robustness in FL. To fill this gap, this work attempts to develop a comprehensive understanding of the effects of LDP on adversarial robustness in FL. Clarifying the interplay is significant since this is the first step towards a principled design of private and robust FL systems. We certify that local differential privacy has both positive and negative effects on adversarial robustness using theoretical analysis and empirical verification.

LGJun 8, 2021

FL-Market: Trading Private Models in Federated Learning

Shuyuan Zheng, Yang Cao, Masatoshi Yoshikawa et al.

The difficulty in acquiring a sufficient amount of training data is a major bottleneck for machine learning (ML) based data analytics. Recently, commoditizing ML models has been proposed as an economical and moderate solution to ML-oriented data acquisition. However, existing model marketplaces assume that the broker can access data owners' private training data, which may not be realistic in practice. In this paper, to promote trustworthy data acquisition for ML tasks, we propose FL-Market, a locally private model marketplace that protects privacy not only against model buyers but also against the untrusted broker. FL-Market decouples ML from the need to centrally gather training data on the broker's side using federated learning, an emerging privacy-preserving ML paradigm in which data owners collaboratively train an ML model by uploading local gradients (to be aggregated into a global gradient for model updating). Then, FL-Market enables data owners to locally perturb their gradients by local differential privacy and thus further prevents privacy risks. To drive FL-Market, we propose a deep learning-empowered auction mechanism for intelligently deciding the local gradients' perturbation levels and an optimal aggregation mechanism for aggregating the perturbed gradients. Our auction and aggregation mechanisms can jointly maximize the global gradient's accuracy, which optimizes model buyers' utility. Our experiments verify the effectiveness of the proposed mechanisms.

CRMay 4, 2021

Pricing Private Data with Personalized Differential Privacy and Partial Arbitrage Freeness

Shuyuan Zheng, Yang Cao, Masatoshi Yoshikawa

There is a growing trend regarding perceiving personal data as a commodity. Existing studies have built frameworks and theories about how to determine an arbitrage-free price of a given query according to the privacy loss quantified by differential privacy. However, those studies have assumed that data buyers can purchase query answers with the arbitrary privacy loss of data owners, which may not be valid under strict privacy regulations and data owners' increasing privacy concerns. In this paper, we study how to empower data owners to control privacy loss in data trading. First, we propose a framework for trading personal data that enables data owners to bound their personalized privacy losses. Second, since bounded privacy losses indicate bounded utilities of query answers, we propose a reasonable relaxation of arbitrage freeness named partial arbitrage freeness, i.e., the guarantee of arbitrage-free pricing only for a limited range of utilities, which provides more possibilities for our market design. Third, to avoid arbitrage, we propose a general method for ensuring arbitrage freeness under personalized differential privacy. Fourth, to fully utilize data owners' personalized privacy loss bounds, we propose privacy budget allocation techniques to allocate privacy losses for queries under arbitrage freeness. Finally, we conduct experiments to verify the effectiveness of our proposed trading protocols.

CRApr 14, 2021

Preventing Manipulation Attack in Local Differential Privacy using Verifiable Randomization Mechanism

Fumiyuki Kato, Yang Cao, Masatoshi Yoshikawa

Several randomization mechanisms for local differential privacy (LDP) (e.g., randomized response) are well-studied to improve the utility. However, recent studies show that LDP is generally vulnerable to malicious data providers in nature. Because a data collector has to estimate background data distribution only from already randomized data, malicious data providers can manipulate their output before sending, i.e., randomization would provide them plausible deniability. Attackers can skew the estimations effectively since they are calculated by normalizing with randomization probability defined in the LDP protocol, and can even control the estimations. In this paper, we show how we prevent malicious attackers from compromising LDP protocol. Our approach is to utilize a verifiable randomization mechanism. The data collector can verify the completeness of executing an agreed randomization mechanism for every data provider. Our proposed method completely protects the LDP protocol from output-manipulations, and significantly mitigates the expected damage from attacks. We do not assume any specific attacks, and it works effectively against general output-manipulation, and thus is more powerful than previously proposed countermeasures. We describe the secure version of three state-of-the-art LDP protocols and empirically show they cause acceptable overheads according to several parameters.

CRMar 1, 2021

Asymmetric Differential Privacy

Shun Takagi, Yang Cao, Masatoshi Yoshikawa

Differential privacy (DP) is getting attention as a privacy definition when publishing statistics of a dataset. This paper focuses on the limitation that DP inevitably causes two-sided error, which is not desirable for epidemic analysis such as how many COVID-19 infected individuals visited location A. For example, consider publishing misinformation that many infected people did not visit location A, which may lead to miss decision-making that expands the epidemic. To fix this issue, we propose a relaxation of DP, called asymmetric differential privacy (ADP). We show that ADP can provide reasonable privacy protection while achieving one-sided error. Finally, we conduct experiments to evaluate the utility of proposed mechanisms for epidemic analysis using a real-world dataset, which shows the practicality of our mechanisms.

CYDec 24, 2020

Quantifying the Privacy-Utility Trade-offs in COVID-19 Contact Tracing Apps

Patrick Ocheja, Yang Cao, Shiyao Ding et al.

How to contain the spread of the COVID-19 virus is a major concern for most countries. As the situation continues to change, various countries are making efforts to reopen their economies by lifting some restrictions and enforcing new measures to prevent the spread. In this work, we review some approaches that have been adopted to contain the COVID-19 virus such as contact tracing, clusters identification, movement restrictions, and status validation. Specifically, we classify available techniques based on some characteristics such as technology, architecture, trade-offs (privacy vs utility), and the phase of adoption. We present a novel approach for evaluating privacy using both qualitative and quantitative measures of privacy-utility assessment of contact tracing applications. In this new method, we classify utility at three (3) distinct levels: no privacy, 100% privacy, and at k where k is set by the system providing the utility or privacy.

CRDec 7, 2020

PCT-TEE: Trajectory-based Private Contact Tracing System with Trusted Execution Environment

Fumiyuki Kato, Yang Cao, Masatoshi Yoshikawa

Existing Bluetooth-based Private Contact Tracing (PCT) systems can privately detect whether people have come into direct contact with COVID-19 patients. However, we find that the existing systems lack functionality and flexibility, which may hurt the success of the contact tracing. Specifically, they cannot detect indirect contact (e.g., people may be exposed to coronavirus because of used the same elevator even without direct contact); they also cannot flexibly change the rules of "risky contact", such as how many hours of exposure or how close to a COVID-19 patient that is considered as risk exposure, which may be changed with the environmental situation. In this paper, we propose an efficient and secure contact tracing system that enables both direct contact and indirect contact. To address the above problems, we need to utilize users' trajectory data for private contact tracing, which we call trajectory-based PCT. We formalize this problem as Spatiotemporal Private Set Intersection. By analyzing different approaches such as homomorphic encryption that could be extended to solve this problem, we identify that Trusted Execution Environment (TEE) is a proposing method to achieve our requirements. The major challenge is how to design algorithms for spatiotemporal private set intersection under limited secure memory of TEE. To this end, we design a TEE-based system with flexible trajectory data encoding algorithms. Our experiments on real-world data show that the proposed system can process thousands of queries on tens of million records of trajectory data in a few seconds.

CROct 26, 2020

Geo-Graph-Indistinguishability: Location Privacy on Road Networks Based on Differential Privacy

Shun Takagi, Yang Cao, Yasuhito Asano et al.

In recent years, concerns about location privacy are increasing with the spread of location-based services (LBSs). Many methods to protect location privacy have been proposed in the past decades. Especially, perturbation methods based on Geo-Indistinguishability (Geo-I), which randomly perturb a true location to a pseudolocation, are getting attention due to its strong privacy guarantee inherited from differential privacy. However, Geo-I is based on the Euclidean plane even though many LBSs are based on road networks (e.g. ride-sharing services). This causes unnecessary noise and thus an insufficient tradeoff between utility and privacy for LBSs on road networks. To address this issue, we propose a new privacy notion, Geo-Graph-Indistinguishability (GG-I), for locations on a road network to achieve a better tradeoff. We propose Graph-Exponential Mechanism (GEM), which satisfies GG-I. Moreover, we formalize the optimization problem to find the optimal GEM in terms of the tradeoff. However, the computational complexity of a naive method to find the optimal solution is prohibitive, so we propose a greedy algorithm to find an approximate solution in an acceptable amount of time. Finally, our experiments show that our proposed mechanism outperforms a Geo-I's mechanism with respect to the tradeoff.

CROct 26, 2020

Secure and Efficient Trajectory-Based Contact Tracing using Trusted Hardware

Fumiyuki Kato, Yang Cao, Masatoshi Yoshikawa

The COVID-19 pandemic has prompted technological measures to control the spread of the disease. Private contact tracing (PCT) is one of the promising techniques for the purpose. However, the recently proposed Bluetooth-based PCT has several limitations in terms of functionality and flexibility. The existing systems are only able to detect direct contact (i.e., human-human contact), but cannot detect indirect contact (i.e., human-object, such as the disease transmission through surface). Moreover, the rule of risky contact cannot be flexibly changed with the environmental situation and the nature of the virus. In this paper, we propose a secure and efficient trajectory-based PCT system using trusted hardware. We formalize trajectory-based PCT as a generalization of the well-studied Private Set Intersection (PSI), which is mostly based on cryptographic primitives and thus insufficient. We solve the problem by leveraging trusted hardware such as Intel SGX and designing a novel algorithm to achieve a secure, efficient and flexible PCT system. Our experiments on real-world data show that the proposed system can achieve high performance and scalability. Specifically, our system (one single machine with Intel SGX) can process thousands of queries on 100 million records of trajectory data in a few seconds.

LGSep 17, 2020

FLAME: Differentially Private Federated Learning in the Shuffle Model

Ruixuan Liu, Yang Cao, Hong Chen et al.

Federated Learning (FL) is a promising machine learning paradigm that enables the analyzer to train a model without collecting users' raw data. To ensure users' privacy, differentially private federated learning has been intensively studied. The existing works are mainly based on the \textit{curator model} or \textit{local model} of differential privacy. However, both of them have pros and cons. The curator model allows greater accuracy but requires a trusted analyzer. In the local model where users randomize local data before sending them to the analyzer, a trusted analyzer is not required but the accuracy is limited. In this work, by leveraging the \textit{privacy amplification} effect in the recently proposed shuffle model of differential privacy, we achieve the best of two worlds, i.e., accuracy in the curator model and strong privacy without relying on any trusted party. We first propose an FL framework in the shuffle model and a simple protocol (SS-Simple) extended from existing work. We find that SS-Simple only provides an insufficient privacy amplification effect in FL since the dimension of the model parameter is quite large. To solve this challenge, we propose an enhanced protocol (SS-Double) to increase the privacy amplification effect by subsampling. Furthermore, for boosting the utility when the model size is greater than the user population, we propose an advanced protocol (SS-Topk) with gradient sparsification techniques. We also provide theoretical analysis and numerical evaluations of the privacy amplification of the proposed protocols. Experiments on real-world dataset validate that SS-Topk improves the testing accuracy by 60.7\% than the local model based FL.

LGJun 22, 2020

P3GM: Private High-Dimensional Data Release via Privacy Preserving Phased Generative Model

Shun Takagi, Tsubasa Takahashi, Yang Cao et al.

How can we release a massive volume of sensitive data while mitigating privacy risks? Privacy-preserving data synthesis enables the data holder to outsource analytical tasks to an untrusted third party. The state-of-the-art approach for this problem is to build a generative model under differential privacy, which offers a rigorous privacy guarantee. However, the existing method cannot adequately handle high dimensional data. In particular, when the input dataset contains a large number of features, the existing techniques require injecting a prohibitive amount of noise to satisfy differential privacy, which results in the outsourced data analysis meaningless. To address the above issue, this paper proposes privacy-preserving phased generative model (P3GM), which is a differentially private generative model for releasing such sensitive data. P3GM employs the two-phase learning process to make it robust against the noise, and to increase learning efficiency (e.g., easy to converge). We give theoretical analyses about the learning complexity and privacy loss in P3GM. We further experimentally evaluate our proposed method and demonstrate that P3GM significantly outperforms existing solutions. Compared with the state-of-the-art methods, our generated samples look fewer noises and closer to the original data in terms of data diversity. Besides, in several data mining tasks with synthesized data, our model outperforms the competitors in terms of accuracy.

CRMay 4, 2020

PGLP: Customizable and Rigorous Location Privacy through Policy Graph

Yang Cao, Yonghui Xiao, Shun Takagi et al.

Location privacy has been extensively studied in the literature. However, existing location privacy models are either not rigorous or not customizable, which limits the trade-off between privacy and utility in many real-world applications. To address this issue, we propose a new location privacy notion called PGLP, i.e., \textit{Policy Graph based Location Privacy}, providing a rich interface to release private locations with customizable and rigorous privacy guarantee. First, we design the privacy metrics of PGLP by extending differential privacy. Specifically, we formalize a user's location privacy requirements using a \textit{location policy graph}, which is expressive and customizable. Second, we investigate how to satisfy an arbitrarily given location policy graph under adversarial knowledge. We find that a location policy graph may not always be viable and may suffer \textit{location exposure} when the attacker knows the user's mobility pattern. We propose efficient methods to detect location exposure and repair the policy graph with optimal utility. Third, we design a private location trace release framework that pipelines the detection of location exposure, policy graph repair, and private trajectory release with customizable and rigorous location privacy. Finally, we conduct experiments on real-world datasets to verify the effectiveness of the privacy-utility trade-off and the efficiency of the proposed algorithms.

DBMay 1, 2020

PANDA: Policy-aware Location Privacy for Epidemic Surveillance

Yang Cao, Shun Takagi, Yonghui Xiao et al.

In this demonstration, we present a privacy-preserving epidemic surveillance system. Recently, many countries that suffer from coronavirus crises attempt to access citizen's location data to eliminate the outbreak. However, it raises privacy concerns and may open the doors to more invasive forms of surveillance in the name of public health. It also brings a challenge for privacy protection techniques: how can we leverage people's mobile data to help combat the pandemic without scarifying our location privacy. We demonstrate that we can have the best of the two worlds by implementing policy-based location privacy for epidemic surveillance. Specifically, we formalize the privacy policy using graphs in light of differential privacy, called policy graph. Our system has three primary functions for epidemic surveillance: location monitoring, epidemic analysis, and contact tracing. We provide an interactive tool allowing the attendees to explore and examine the usability of our system: (1) the utility of location monitor and disease transmission model estimation, (2) the procedure of contact tracing in our systems, and (3) the privacy-utility trade-offs w.r.t. different policy graphs. The attendees can find that it is possible to have the full functionality of epidemic surveillance while preserving location privacy.

CRApr 16, 2020

Voice-Indistinguishability: Protecting Voiceprint in Privacy-Preserving Speech Data Release

Yaowei Han, Sheng Li, Yang Cao et al.

With the development of smart devices, such as the Amazon Echo and Apple's HomePod, speech data have become a new dimension of big data. However, privacy and security concerns may hinder the collection and sharing of real-world speech data, which contain the speaker's identifiable information, i.e., voiceprint, which is considered a type of biometric identifier. Current studies on voiceprint privacy protection do not provide either a meaningful privacy-utility trade-off or a formal and rigorous definition of privacy. In this study, we design a novel and rigorous privacy metric for voiceprint privacy, which is referred to as voice-indistinguishability, by extending differential privacy. We also propose mechanisms and frameworks for privacy-preserving speech data release satisfying voice-indistinguishability. Experiments on public datasets verify the effectiveness and efficiency of the proposed methods.

LGMar 24, 2020

FedSel: Federated SGD under Local Differential Privacy with Top-k Dimension Selection

Ruixuan Liu, Yang Cao, Masatoshi Yoshikawa et al.

As massive data are produced from small gadgets, federated learning on mobile devices has become an emerging trend. In the federated setting, Stochastic Gradient Descent (SGD) has been widely used in federated learning for various machine learning models. To prevent privacy leakages from gradients that are calculated on users' sensitive data, local differential privacy (LDP) has been considered as a privacy guarantee in federated SGD recently. However, the existing solutions have a dimension dependency problem: the injected noise is substantially proportional to the dimension $d$. In this work, we propose a two-stage framework FedSel for federated SGD under LDP to relieve this problem. Our key idea is that not all dimensions are equally important so that we privately select Top-k dimensions according to their contributions in each iteration of federated SGD. Specifically, we propose three private dimension selection mechanisms and adapt the gradient accumulation technique to stabilize the learning process with noisy updates. We also theoretically analyze privacy, accuracy and time complexity of FedSel, which outperforms the state-of-the-art solutions. Experiments on real-world and synthetic datasets verify the effectiveness and efficiency of our framework.

DBJul 25, 2019

Protecting Spatiotemporal Event Privacy in Continuous Location-Based Services

Yang Cao, Yonghui Xiao, Li Xiong et al.

Location privacy-preserving mechanisms (LPPMs) have been extensively studied for protecting users' location privacy by releasing a perturbed location to third parties such as location-based service providers. However, when a user's perturbed locations are released continuously, existing LPPMs may not protect the sensitive information about the user's spatiotemporal activities, such as "visited hospital in the last week" or "regularly commuting between Address 1 and Address 2" (it is easy to infer that Addresses 1 and 2 may be home and office), which we call it \textit{spatiotemporal event}. In this paper, we first formally define {spatiotemporal event} as Boolean expressions between location and time predicates, and then we define $ ε$-\textit{spatiotemporal event privacy} by extending the notion of differential privacy. Second, to understand how much spatiotemporal event privacy that existing LPPMs can provide, we design computationally efficient algorithms to quantify the privacy leakage of state-of-the-art LPPMs when an adversary has prior knowledge of the user's initial probability over possible locations. It turns out that the existing LPPMs cannot adequately protect spatiotemporal event privacy. Third, we propose a framework, PriSTE, to transform an existing LPPM into one protecting spatiotemporal event privacy against adversaries with \textit{any} prior knowledge. Our experiments on real-life and synthetic data verified that the proposed method is effective and efficient.

CRJun 13, 2019

Trading Location Data with Bounded Personalized Privacy Loss

Shuyuan Zheng, Yang Cao, Masatoshi Yoshikawa

As personal data have been the new oil of the digital era, there is a growing trend perceiving personal data as a commodity. Although some people are willing to trade their personal data for money, they might still expect limited privacy loss, and the maximum tolerable privacy loss varies with each individual. In this paper, we propose a framework that enables individuals to trade their personal data with bounded personalized privacy loss, which raises technical challenges in the aspects of budget allocation and arbitrage-freeness. To deal with those challenges,we propose two arbitrage-free trading mechanisms with different advantages.

SEApr 24, 2019

Blockchain-based Bidirectional Updates on Fine-grained Medical Data

Chunmiao Li, Yang Cao, Zhenjiang Hu et al.

Electronic medical data sharing between stakeholders, such as patients, doctors, and researchers, can promote more effective medical treatment collaboratively. These sensitive and private data should only be accessed by authorized users. Given a total medical data, users may care about parts of them and other unrelated information might interfere with the user interested data search and increase the risk of exposure. Besides accessing these data, users may want to update them and propagate to other sharing peers so that all peers keep identical data after each update. To satisfy these requirements, in this paper we propose a medical data sharing architecture that addresses the permission control using smart contracts on the blockchain and splits data into fined grained pieces shared with different peers then synchronize full data and these pieces with bidirectional transformations. Medical data reside on each userś local database and permission related data are stored on smart contracts. Only all peers have gained the newest shared data after updates can they start to do next operations on it, which are enforced by smart contracts. Blockchain based immutable shared ledge enables users to trace data updates history. This paper can provide a new perspective to view full medical data as different slices to be shared with various peers but consistency after updates between them are still promised, which can protect the privacy and improve data search efficiency.

DBApr 24, 2019

When and where do you want to hide? Recommendation of location privacy preferences with local differential privacy

Maho Asada, Masatoshi Yoshikawa, Yang Cao

In recent years, it has become easy to obtain location information quite precisely. However, the acquisition of such information has risks such as individual identification and leakage of sensitive information, so it is necessary to protect the privacy of location information. For this purpose, people should know their location privacy preferences, that is, whether or not he/she can release location information at each place and time. However, it is not easy for each user to make such decisions and it is troublesome to set the privacy preference at each time. Therefore, we propose a method to recommend location privacy preferences for decision making. Comparing to existing method, our method can improve the accuracy of recommendation by using matrix factorization and preserve privacy strictly by local differential privacy, whereas the existing method does not achieve formal privacy guarantee. In addition, we found the best granularity of a location privacy preference, that is, how to express the information in location privacy protection. To evaluate and verify the utility of our method, we have integrated two existing datasets to create a rich information in term of user number. From the results of the evaluation using this dataset, we confirmed that our method can predict location privacy preferences accurately and that it provides a suitable method to define the location privacy preference.

CVApr 23, 2018

Beyond Narrative Description: Generating Poetry from Images by Multi-Adversarial Training

Bei Liu, Jianlong Fu, Makoto P. Kato et al.

Automatic generation of natural language from images has attracted extensive attention. In this paper, we take one step further to investigate generation of poetic language (with multiple lines) to an image for automatic poetry creation. This task involves multiple challenges, including discovering poetic clues from the image (e.g., hope from green), and generating poems to satisfy both relevance to the image and poeticness in language level. To solve the above challenges, we formulate the task of poem generation into two correlated sub-tasks by multi-adversarial training via policy gradient, through which the cross-modal relevance and poetic language style can be ensured. To extract poetic clues from images, we propose to learn a deep coupled visual-poetic embedding, in which the poetic representation from objects, sentiments and scenes in an image can be jointly learned. Two discriminative networks are further introduced to guide the poem generation, including a multi-modal discriminator and a poem-style discriminator. To facilitate the research, we have released two poem datasets by human annotators with two distinct properties: 1) the first human annotated image-to-poem pair dataset (with 8,292 pairs in total), and 2) to-date the largest public English poem corpus dataset (with 92,265 different poems in total). Extensive experiments are conducted with 8K images, among which 1.5K image are randomly picked for evaluation. Both objective and subjective evaluations show the superior performances against the state-of-the-art methods for poem generation from images. Turing test carried out with over 500 human subjects, among which 30 evaluators are poetry experts, demonstrates the effectiveness of our approach.

DBOct 24, 2016

Quantifying Differential Privacy under Temporal Correlations

Yang Cao, Masatoshi Yoshikawa, Yonghui Xiao et al.

Differential Privacy (DP) has received increased attention as a rigorous privacy framework. Existing studies employ traditional DP mechanisms (e.g., the Laplace mechanism) as primitives, which assume that the data are independent, or that adversaries do not have knowledge of the data correlations. However, continuously generated data in the real world tend to be temporally correlated, and such correlations can be acquired by adversaries. In this paper, we investigate the potential privacy loss of a traditional DP mechanism under temporal correlations in the context of continuous data release. First, we model the temporal correlations using Markov model and analyze the privacy leakage of a DP mechanism when adversaries have knowledge of such temporal correlations. Our analysis reveals that the privacy leakage of a DP mechanism may accumulate and increase over time. We call it temporal privacy leakage. Second, to measure such privacy leakage, we design an efficient algorithm for calculating it in polynomial time. Although the temporal privacy leakage may increase over time, we also show that its supremum may exist in some cases. Third, to bound the privacy loss, we propose mechanisms that convert any existing DP mechanism into one against temporal privacy leakage. Experiments with synthetic data confirm that our approach is efficient and effective.

CYApr 20, 2016

Your Neighbors Are My Spies: Location and other Privacy Concerns in Dating Apps

Nguyen Phong Hoang, Yasuhito Asano, Masatoshi Yoshikawa

Trilateration has recently become one of the well-known threat models to the user's location privacy in location-based applications (aka: location-based services or LBS), especially those containing highly sensitive information such as dating applications. The threat model mainly depends on the distance shown from the targeted victim to the adversary to pinpoint the victim's position. As a countermeasure, most of location-based applications have already implemented the "hide distance" function to protect their user's location privacy. The effectiveness of such approaches however is still questionable. Therefore, in this paper, we first investigate how popular location-based dating applications are currently protecting their user's privacy by testing the two most popular GLBT-focused applications: Jack'd and Grindr.

CYApr 20, 2016

Your Neighbors Are My Spies: Location and other Privacy Concerns in GLBT-focused Location-based Dating Applications

Nguyen Phong Hoang, Yasuhito Asano, Masatoshi Yoshikawa

Trilateration is one of the well-known threat models to the user's location privacy in location-based apps, especially those contain highly sensitive information such as dating apps. The threat model mainly bases on the publicly shown distance from a targeted victim to the adversary to pinpoint the victim's location. As a countermeasure, most of location-based apps have already implemented the 'hide distance' function, or added noise to the publicly shown distance in order to protect their user's location privacy. The effectiveness of such approaches however is still questionable.