David Sánchez

h-index15

25papers

1,385citations

Novelty36%

AI Score43

Ranked #77,782 of 205,806 authors (top 38%)#1,850 in CR (top 25%)

25 Papers

CVNov 20, 2023Code

Multi-Task Faces (MTF) Data Set: A Legally and Ethically Compliant Collection of Face Images for Various Classification Tasks

Rami Haffar, David Sánchez, Josep Domingo-Ferrer

Human facial data offers valuable potential for tackling classification problems, including face recognition, age estimation, gender identification, emotion analysis, and race classification. However, recent privacy regulations, particularly the EU General Data Protection Regulation, have restricted the collection and usage of human images in research. As a result, several previously published face data sets have been removed from the internet due to inadequate data collection methods and privacy concerns. While synthetic data sets have been suggested as an alternative, they fall short of accurately representing the real data distribution. Additionally, most existing data sets are labeled for just a single task, which limits their versatility. To address these limitations, we introduce the Multi-Task Face (MTF) data set, designed for various tasks, including face recognition and classification by race, gender, and age, as well as for aiding in training generative networks. The MTF data set comes in two versions: a non-curated set containing 132,816 images of 640 individuals and a manually curated set with 5,246 images of 240 individuals, meticulously selected to maximize their classification quality. Both data sets were ethically sourced, using publicly available celebrity images in full compliance with copyright regulations. Along with providing detailed descriptions of data collection and processing, we evaluated the effectiveness of the MTF data set in training five deep learning models across the aforementioned classification tasks, achieving up to 98.88\% accuracy for gender classification, 95.77\% for race classification, 97.60\% for age classification, and 79.87\% for face recognition with the ConvNeXT model. Both MTF data sets can be accessed through the following link. https://github.com/RamiHaf/MTF_data_set

CRJul 5, 2022

Defending against the Label-flipping Attack in Federated Learning

Najeeb Moharram Jebreel, Josep Domingo-Ferrer, David Sánchez et al.

Federated learning (FL) provides autonomy and privacy by design to participating peers, who cooperatively build a machine learning (ML) model while keeping their private data in their devices. However, that same autonomy opens the door for malicious peers to poison the model by conducting either untargeted or targeted poisoning attacks. The label-flipping (LF) attack is a targeted poisoning attack where the attackers poison their training data by flipping the labels of some examples from one class (i.e., the source class) to another (i.e., the target class). Unfortunately, this attack is easy to perform and hard to detect and it negatively impacts on the performance of the global model. Existing defenses against LF are limited by assumptions on the distribution of the peers' data and/or do not perform well with high-dimensional models. In this paper, we deeply investigate the LF attack behavior and find that the contradicting objectives of attackers and honest peers on the source class examples are reflected in the parameter gradients corresponding to the neurons of the source and target classes in the output layer, making those gradients good discriminative features for the attack detection. Accordingly, we propose a novel defense that first dynamically extracts those gradients from the peers' local updates, and then clusters the extracted gradients, analyzes the resulting clusters and filters out potential bad updates before model aggregation. Extensive empirical analysis on three data sets shows the proposed defense's effectiveness against the LF attack regardless of the data distribution or model dimensionality. Also, the proposed defense outperforms several state-of-the-art defenses by offering lower test error, higher overall accuracy, higher source class accuracy, lower attack success rate, and higher stability of the source class accuracy.

CRNov 6, 2023

An Examination of the Alleged Privacy Threats of Confidence-Ranked Reconstruction of Census Microdata

David Sánchez, Najeeb Jebreel, Krishnamurty Muralidhar et al.

The threat of reconstruction attacks has led the U.S. Census Bureau (USCB) to replace in the Decennial Census 2020 the traditional statistical disclosure limitation based on rank swapping with one based on differential privacy (DP), leading to substantial accuracy loss of released statistics. Yet, it has been argued that, if many different reconstructions are compatible with the released statistics, most of them do not correspond to actual original data, which protects against respondent reidentification. Recently, a new attack has been proposed, which incorporates the confidence that a reconstructed record was in the original data. The alleged risk of disclosure entailed by such confidence-ranked reconstruction has renewed the interest of the USCB to use DP-based solutions. To forestall a potential accuracy loss in future releases, we show that the proposed reconstruction is neither effective as a reconstruction method nor conducive to disclosure as claimed by its authors. Specifically, we report empirical results showing the proposed ranking cannot guide reidentification or attribute disclosure attacks, and hence fails to warrant the utility sacrifice entailed by the use of DP to release census statistical data.

SOC-PHJul 19, 2023

When Dialects Collide: How Socioeconomic Mixing Affects Language Use

Thomas Louf, José J. Ramasco, David Sánchez et al.

The socioeconomic background of people and how they use standard forms of language are not independent, as demonstrated in various sociolinguistic studies. However, the extent to which these correlations may be influenced by the mixing of people from different socioeconomic classes remains relatively unexplored from a quantitative perspective. In this work we leverage geotagged tweets and transferable computational methods to map deviations from standard English on a large scale, in seven thousand administrative areas of England and Wales. We combine these data with high-resolution income maps to assign a proxy socioeconomic indicator to home-located users. Strikingly, across eight metropolitan areas we find a consistent pattern suggesting that the more different socioeconomic classes mix, the less interdependent the frequency of their departures from standard grammar and their income become. Further, we propose an agent-based model of linguistic variety adoption that sheds light on the mechanisms that produce the observations seen in the data.

CRMar 8Code

Revisiting the LiRA Membership Inference Attack Under Realistic Assumptions

Najeeb Jebreel, Mona Khalil, David Sánchez et al.

Membership inference attacks (MIAs) have become the standard tool for evaluating privacy leakage in machine learning (ML). Among them, the Likelihood-Ratio Attack (LiRA) is widely regarded as the state of the art when sufficient shadow models are available. However, prior evaluations have often overstated the effectiveness of LiRA by attacking models overconfident on their training samples, calibrating thresholds on target data, assuming balanced membership priors, and/or overlooking attack reproducibility. We re-evaluate LiRA under a realistic protocol that (i) trains models using anti-overfitting (AOF) and transfer learning (TL), when applicable, to reduce overconfidence as in production models; (ii) calibrates decision thresholds using shadow models and data rather than target data; (iii) measures positive predictive value (PPV, or precision) under shadow-based thresholds and skewed membership priors (pi <= 10%); and (iv) quantifies per-sample membership reproducibility across different seeds and training variations. We find that AOF significantly weakens LiRA, while TL further reduces attack effectiveness while improving model accuracy. Under shadow-based thresholds and skewed priors, LiRA's PPV often drops substantially, especially under AOF or AOF+TL. We also find that thresholded vulnerable sets at extremely low FPR show poor reproducibility across runs, while likelihood-ratio rankings are more stable. These results suggest that LiRA, and likely weaker MIAs, are less effective than previously suggested under realistic conditions, and that reliable privacy auditing requires evaluation protocols that reflect practical training practices, feasible attacker assumptions, and reproducibility considerations. Code is available at https://github.com/najeebjebreel/lira_analysis.

CRJul 7, 2025Code

Efficient Unlearning with Privacy Guarantees

Josep Domingo-Ferrer, Najeeb Jebreel, David Sánchez

Privacy protection laws, such as the GDPR, grant individuals the right to request the forgetting of their personal data not only from databases but also from machine learning (ML) models trained on them. Machine unlearning has emerged as a practical means to facilitate model forgetting of data instances seen during training. Although some existing machine unlearning methods guarantee exact forgetting, they are typically costly in computational terms. On the other hand, more affordable methods do not offer forgetting guarantees and are applicable only to specific ML models. In this paper, we present \emph{efficient unlearning with privacy guarantees} (EUPG), a novel machine unlearning framework that offers formal privacy guarantees to individuals whose data are being unlearned. EUPG involves pre-training ML models on data protected using privacy models, and it enables {\em efficient unlearning with the privacy guarantees offered by the privacy models in use}. Through empirical evaluation on four heterogeneous data sets protected with $k$-anonymity and $ε$-differential privacy as privacy models, our approach demonstrates utility and forgetting effectiveness comparable to those of exact unlearning methods, while significantly reducing computational and storage costs. Our code is available at https://github.com/najeebjebreel/EUPG.

CLJan 25, 2022Code

The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization

Ildikó Pilán, Pierre Lison, Lilja Øvrelid et al.

We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared to previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored towards measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts and baseline models are available on: https://github.com/NorskRegnesentral/text-anonymisation-benchmark

CRApr 2, 2024

Digital Forgetting in Large Language Models: A Survey of Unlearning Methods

Alberto Blanco-Justicia, Najeeb Jebreel, Benet Manzanares et al.

The objective of digital forgetting is, given a model with undesirable knowledge or behavior, obtain a new model where the detected issues are no longer present. The motivations for forgetting include privacy protection, copyright protection, elimination of biases and discrimination, and prevention of harmful content generation. Effective digital forgetting has to be effective (meaning how well the new model has forgotten the undesired knowledge/behavior), retain the performance of the original model on the desirable tasks, and be scalable (in particular forgetting has to be more efficient than retraining from scratch on just the tasks/data to be retained). This survey focuses on forgetting in large language models (LLMs). We first provide background on LLMs, including their components, the types of LLMs, and their usual training pipeline. Second, we describe the motivations, types, and desired properties of digital forgetting. Third, we introduce the approaches to digital forgetting in LLMs, among which unlearning methodologies stand out as the state of the art. Fourth, we provide a detailed taxonomy of machine unlearning methods for LLMs, and we survey and compare current approaches. Fifth, we detail datasets, models and metrics used for the evaluation of forgetting, retaining and runtime. Sixth, we discuss challenges in the area. Finally, we provide some concluding remarks.

CLMar 27, 2024

Exploring language relations through syntactic distances and geographic proximity

Juan De Gregorio, Raúl Toral, David Sánchez

Languages are grouped into families that share common linguistic traits. While this approach has been successful in understanding genetic relations between diverse languages, more analyses are needed to accurately quantify their relatedness, especially in less studied linguistic levels such as syntax. Here, we explore linguistic distances using series of parts of speech (POS) extracted from the Universal Dependencies dataset. Within an information-theoretic framework, we show that employing POS trigrams maximizes the possibility of capturing syntactic variations while being at the same time compatible with the amount of available data. Linguistic connections are then established by assessing pairwise distances based on the POS distributions. Intriguingly, our analysis reveals definite clusters that correspond to well known language families and groups, with exceptions explained by distinct morphological typologies. Furthermore, we obtain a significant correlation between language similarity and geographic distance, which underscores the influence of spatial proximity on language kinships.

CLDec 17, 2024

Truthful Text Sanitization Guided by Inference Attacks

Ildikó Pilán, Benet Manzanares-Salor, David Sánchez et al.

Text sanitization aims to rewrite parts of a document to prevent disclosure of personal information. The central challenge of text sanitization is to strike a balance between privacy protection (avoiding the leakage of personal information) and utility preservation (retaining as much as possible of the document's original content). To this end, we introduce a novel text sanitization method based on generalizations, that is, broader but still informative terms that subsume the semantic content of the original text spans. The approach relies on the use of instruction-tuned large language models (LLMs) and is divided into two stages. Given a document including text spans expressing personally identifiable information (PII), the LLM is first applied to obtain truth-preserving replacement candidates for each text span and rank those according to their abstraction level. Those candidates are then evaluated for their ability to protect privacy by conducting inference attacks with the LLM. Finally, the system selects the most informative replacement candidate shown to be resistant to those attacks. This two-stage process produces replacements that effectively balance privacy and utility. We also present novel metrics to evaluate these two aspects without needing to manually annotate documents. Results on the Text Anonymization Benchmark show that the proposed approach, implemented with Mistral 7B Instruct, leads to enhanced utility, with only a marginal (< 1 p.p.) increase in re-identification risk compared to fully suppressing the original spans. Furthermore, our approach is shown to be more truth-preserving than existing methods such as Microsoft Presidio's synthetic replacements.

CRAug 4, 2021

Secure and Privacy-Preserving Federated Learning via Co-Utility

Josep Domingo-Ferrer, Alberto Blanco-Justicia, Jesús Manjón et al.

The decentralized nature of federated learning, that often leverages the power of edge devices, makes it vulnerable to attacks against privacy and security. The privacy risk for a peer is that the model update she computes on her private data may, when sent to the model manager, leak information on those private data. Even more obvious are security attacks, whereby one or several malicious peers return wrong model updates in order to disrupt the learning process and lead to a wrong model being learned. In this paper we build a federated learning framework that offers privacy to the participating peers as well as security against Byzantine and poisoning attacks. Our framework consists of several protocols that provide strong privacy to the participating peers via unlinkable anonymity and that are rationally sustainable based on the co-utility property. In other words, no rational party is interested in deviating from the proposed protocols. We leverage the notion of co-utility to build a decentralized co-utile reputation management system that provides incentives for parties to adhere to the protocols. Unlike privacy protection via differential privacy, our approach preserves the values of model updates and hence the accuracy of plain federated learning; unlike privacy protection via update aggregation, our approach preserves the ability to detect bad model updates while substantially reducing the computational overhead compared to methods based on homomorphic encryption.

CRDec 12, 2020

Achieving Security and Privacy in Federated Learning Systems: Survey, Research Challenges and Future Directions

Alberto Blanco-Justicia, Josep Domingo-Ferrer, Sergio Martínez et al.

Federated learning (FL) allows a server to learn a machine learning (ML) model across multiple decentralized clients that privately store their own training data. In contrast with centralized ML approaches, FL saves computation to the server and does not require the clients to outsource their private data to the server. However, FL is not free of issues. On the one hand, the model updates sent by the clients at each training epoch might leak information on the clients' private data. On the other hand, the model learnt by the server may be subjected to attacks by malicious clients; these security attacks might poison the model or prevent it from converging. In this paper, we first examine security and privacy attacks to FL and critically survey solutions proposed in the literature to mitigate each attack. Afterwards, we discuss the difficulty of simultaneously achieving security and privacy protection. Finally, we sketch ways to tackle this open problem and attain both security and privacy.

CRNov 4, 2020

The Limits of Differential Privacy (and its Misuse in Data Release and Machine Learning)

Josep Domingo-Ferrer, David Sánchez, Alberto Blanco-Justicia

Differential privacy (DP) is a neat privacy definition that can co-exist with certain well-defined data uses in the context of interactive queries. However, DP is neither a silver bullet for all privacy problems nor a replacement for all previous privacy models. In fact, extreme care should be exercised when trying to extend its use beyond the setting it was designed for. This paper reviews the limitations of DP and its misuse for individual data collection, individual data release, and machine learning.

CRAug 3, 2018

How to Avoid Reidentification with Proper Anonymization

David Sánchez, Sergio Martínez, Josep Domingo-Ferrer

De Montjoye et al. claimed that most individuals can be reidentified from a deidentified transaction database and that anonymization mechanisms are not effective against reidentification. We demonstrate that anonymization can be performed by techniques well established in the literature.

CLJul 3, 2017

Mapping the Americanization of English in Space and Time

Bruno Gonçalves, Lucía Loureiro-Porto, José J. Ramasco et al.

As global political preeminence gradually shifted from the United Kingdom to the United States, so did the capacity to culturally influence the rest of the world. In this work, we analyze how the world-wide varieties of written English are evolving. We study both the spatial and temporal variations of vocabulary and spelling of English using a large corpus of geolocated tweets and the Google Books datasets corresponding to books published in the US and the UK. The advantage of our approach is that we can address both standard written language (Google Books) and the more colloquial forms of microblogging messages (Twitter). We find that American English is the dominant form of English outside the UK and that its influence is felt even within the UK borders. Finally, we analyze how this trend has evolved over time and the impact that some cultural events have had in shaping it.

CRJul 3, 2017

Privacy-preserving data outsourcing in the cloud via semantic data splitting

David Sánchez, Montserrat Batet

Even though cloud computing provides many intrinsic benefits, privacy concerns related to the lack of control over the storage and management of the outsourced data still prevent many customers from migrating to the cloud. Several privacy-protection mechanisms based on a prior encryption of the data to be outsourced have been proposed. Data encryption offers robust security, but at the cost of hampering the efficiency of the service and limiting the functionalities that can be applied over the (encrypted) data stored on cloud premises. Because both efficiency and functionality are crucial advantages of cloud computing, in this paper we aim at retaining them by proposing a privacy-protection mechanism that relies on splitting (clear) data, and on the distributed storage offered by the increasingly popular notion of multi-clouds. We propose a semantically-grounded data splitting mechanism that is able to automatically detect pieces of data that may cause privacy risks and split them on local premises, so that each chunk does not incur in those risks; then, chunks of clear data are independently stored into the separate locations of a multi-cloud, so that external entities cannot have access to the whole confidential data. Because partial data are stored in clear on cloud premises, outsourced functionalities are seamlessly and efficiently supported by just broadcasting queries to the different cloud locations. To enforce a robust privacy notion, our proposal relies on a privacy model that offers a priori privacy guarantees; to ensure its feasibility, we have designed heuristic algorithms that minimize the number of cloud storage locations we need; to show its potential and generality, we have applied it to the least structured and most challenging data type: plain textual documents.

CRJan 2, 2017

Toward sensitive document release with privacy guarantees

David Sánchez, Montserrat Batet

Privacy has become a serious concern for modern Information Societies. The sensitive nature of much of the data that are daily exchanged or released to untrusted parties requires that responsible organizations undertake appropriate privacy protection measures. Nowadays, much of these data are texts (e.g., emails, messages posted in social media, healthcare outcomes, etc.) that, because of their unstructured and semantic nature, constitute a challenge for automatic data protection methods. In fact, textual documents are usually protected manually, in a process known as document redaction or sanitization. To do so, human experts identify sensitive terms (i.e., terms that may reveal identities and/or confidential information) and protect them accordingly (e.g., via removal or, preferably, generalization). To relieve experts from this burdensome task, in a previous work we introduced the theoretical basis of C-sanitization, an inherently semantic privacy model that provides the basis to the development of automatic document redaction/sanitization algorithms and offers clear and a priori privacy guarantees on data protection; even though its potential benefits C-sanitization still presents some limitations when applied to practice (mainly regarding flexibility, efficiency and accuracy). In this paper, we propose a new more flexible model, named (C, g(C))-sanitization, which enables an intuitive configuration of the trade-off between the desired level of protection (i.e., controlled information disclosure) and the preservation of the utility of the protected data (i.e., amount of semantics to be preserved). Moreover, we also present a set of technical solutions and algorithms that provide an efficient and scalable implementation of the model and improve its practical accuracy, as we also illustrate through empirical experiments.

CRDec 7, 2016

Individual Differential Privacy: A Utility-Preserving Formulation of Differential Privacy Guarantees

Jordi Soria-Comas, Josep Domingo-Ferrer, David Sánchez et al.

Differential privacy is a popular privacy model within the research community because of the strong privacy guarantee it offers, namely that the presence or absence of any individual in a data set does not significantly influence the results of analyses on the data set. However, enforcing this strict guarantee in practice significantly distorts data and/or limits data uses, thus diminishing the analytical utility of the differentially private results. In an attempt to address this shortcoming, several relaxations of differential privacy have been proposed that trade off privacy guarantees for improved data utility. In this work, we argue that the standard formalization of differential privacy is stricter than required by the intuitive privacy guarantee it seeks. In particular, the standard formalization requires indistinguishability of results between any pair of neighbor data sets, while indistinguishability between the actual data set and its neighbor data sets should be enough. This limits the data controller's ability to adjust the level of protection to the actual data, hence resulting in significant accuracy loss. In this respect, we propose individual differential privacy, an alternative differential privacy notion that offers em the same privacy guarantees as standard differential privacy to individuals (even though not to groups of individuals). This new notion allows the data controller to adjust the distortion to the actual data set, which results in less distortion and more analytical accuracy. We propose several mechanisms to attain individual differential privacy and we compare the new notion against standard differential privacy in terms of the accuracy of the analytical results.

SIJul 4, 2016

Privacy-driven Access Control in Social Networks by Means of Automatic Semantic Annotation

Malik Imran-Daud, David Sánchez, Alexandre Viejo

In online social networks (OSN), users quite usually disclose sensitive information about themselves by publishing messages. At the same time, they are (in many cases) unable to properly manage the access to this sensitive information due to the following issues: i) the rigidness of the access control mechanism implemented by the OSN, and ii) many users lack of technical knowledge about data privacy and access control. To tackle these limitations, in this paper, we propose a dynamic, transparent and privacy-driven access control mechanism for textual messages published in OSNs. The notion of privacy-driven is achieved by analyzing the semantics of the messages to be published and, according to that, assessing the degree of sensitiveness of their contents. For this purpose, the proposed system relies on an automatic semantic annotation mechanism that, by using knowledge bases and linguistic tools, is able to associate a meaning to the information to be published. By means of this annotation, our mechanism automatically detects the information that is sensitive according to the privacy requirements of the publisher of data, with regard to the type of reader that may access such data. Finally, our access control mechanism automatically creates sanitized versions of the users' publications according to the type of reader that accesses them. As a result, our proposal, which can be integrated in already existing social networks, provides an automatic, seamless and content-driven protection of user publications, which are coherent with her privacy requirements and the type of readers that access them. Complementary to the system design, we also discuss the feasibility of the system by illustrating it through a real example and evaluate its accuracy and effectiveness over standard approaches.

CRDec 9, 2015

t-Closeness through Microaggregation: Strict Privacy with Enhanced Utility Preservation

Jordi Soria-Comas, Josep Domingo-Ferrer, David Sánchez et al.

Microaggregation is a technique for disclosure limitation aimed at protecting the privacy of data subjects in microdata releases. It has been used as an alternative to generalization and suppression to generate $k$-anonymous data sets, where the identity of each subject is hidden within a group of $k$ subjects. Unlike generalization, microaggregation perturbs the data and this additional masking freedom allows improving data utility in several ways, such as increasing data granularity, reducing the impact of outliers and avoiding discretization of numerical data. $k$-Anonymity, on the other side, does not protect against attribute disclosure, which occurs if the variability of the confidential values in a group of $k$ subjects is too small. To address this issue, several refinements of $k$-anonymity have been proposed, among which $t$-closeness stands out as providing one of the strictest privacy guarantees. Existing algorithms to generate $t$-close data sets are based on generalization and suppression (they are extensions of $k$-anonymization algorithms based on the same principles). This paper proposes and shows how to use microaggregation to generate $k$-anonymous $t$-close data sets. The advantages of microaggregation are analyzed, and then several microaggregation algorithms for $k$-anonymous $t$-closeness are presented and empirically evaluated.

CRDec 9, 2015

Utility-Preserving Differentially Private Data Releases Via Individual Ranking Microaggregation

David Sánchez, Josep Domingo-Ferrer, Sergio Martínez et al.

Being able to release and exploit open data gathered in information systems is crucial for researchers, enterprises and the overall society. Yet, these data must be anonymized before release to protect the privacy of the subjects to whom the records relate. Differential privacy is a privacy model for anonymization that offers more robust privacy guarantees than previous models, such as $k$-anonymity and its extensions. However, it is often disregarded that the utility of differentially private outputs is quite limited, either because of the amount of noise that needs to be added to obtain them or because utility is only preserved for a restricted type and/or a limited number of queries. On the contrary, $k$-anonymity-like data releases make no assumptions on the uses of the protected data and, thus, do not restrict the number and type of doable analyses. Recently, some authors have proposed mechanisms to offer general-purpose differentially private data releases. This paper extends such works with a specific focus on the preservation of the utility of the protected data. Our proposal builds on microaggregation-based anonymization, which is more flexible and utility-preserving than alternative anonymization methods used in the literature, in order to reduce the amount of noise needed to satisfy differential privacy. In this way, we improve the utility of differentially private data releases. Moreover, the noise reduction we achieve does not depend on the size of the data set, but just on the number of attributes to be protected, which is a more desirable behavior for large data sets. The utility benefits brought by our proposal are empirically evaluated and compared with related works for several data sets and metrics.

CRDec 9, 2015

Enforcing transparent access to private content in social networks by means of automatic sanitization

Alexandre Viejo, David Sánchez

Social networks have become an essential meeting point for millions of individuals willing to publish and consume huge quantities of heterogeneous information. Some studies have shown that the data published in these platforms may contain sensitive personal information and that external entities can gather and exploit this knowledge for their own benefit. Even though some methods to preserve the privacy of social networks users have been proposed, they generally apply rigid access control measures to the protected content and, even worse, they do not enable the users to understand which contents are sensitive. Last but not least, most of them require the collaboration of social network operators or they fail to provide a practical solution capable of working with well-known and already deployed social platforms. In this paper, we propose a new scheme that addresses all these issues. The new system is envisaged as an independent piece of software that does not depend on the social network in use and that can be transparently applied to most existing ones. According to a set of privacy requirements intuitively defined by the users of a social network, the proposed scheme is able to: (i) automatically detect sensitive data in users' publications; (ii) construct sanitized versions of such data; and (iii) provide privacy-preserving transparent access to sensitive contents by disclosing more or less information to readers according to their credentials toward the owner of the publications. We also study the applicability of the proposed system in general and illustrate its behavior in two case studies.

CRNov 18, 2015

Supplementary Materials for "How to Avoid Reidentification with Proper Anonymization"- Comment on "Unique in the shopping mall: on the reidentifiability of credit card metadata"

David Sánchez, Sergio Martínez, Josep Domingo-Ferrer

The study by De Montjoye et al. ("Science", 30 January 2015, p. 536) claimed that most individuals can be reidentified from a deidentified credit card transaction database and that anonymization mechanisms are not effective against reidentification. Such claims deserve detailed quantitative scrutiny, as they might seriously undermine the willingness of data owners and subjects to share data for research. In a recent Technical Comment published in "Science" (18 March 2016, p. 1274), we demonstrate that the reidentification risk reported by De Montjoye et al. was significantly overestimated (due to a misunderstanding of the reidentification attack) and that the alleged ineffectiveness of anonymization is due to the choice of poor and undocumented methods and to a general disregard of 40 years of anonymization literature. The technical comment also shows how to properly anonymize data, in order to reduce unequivocal reidentifications to zero while retaining even more analytical utility than with the poor anonymization mechanisms employed by De Montjoye et al. In conclusion, data owners, subjects and users can be reassured that sound privacy models and anonymization methods exist to produce safe and useful anonymized data. Supplementary materials detailing the data sets, algorithms and extended results of our study are available here. Moreover, unlike the De Montjoye et al.'s data set, which was never made available, our data, anonymized results, and anonymization algorithms can be freely downloaded from http://crises-deim.urv.cat/opendata/SPD_Science.zip

MLNov 16, 2015

Learning about Spanish dialects through Twitter

Bruno Gonçalves, David Sánchez

This paper maps the large-scale variation of the Spanish language by employing a corpus based on geographically tagged Twitter messages. Lexical dialects are extracted from an analysis of variants of tens of concepts. The resulting maps show linguistic variation on an unprecedented scale across the globe. We discuss the properties of the main dialects within a machine learning approach and find that varieties spoken in urban areas have an international character in contrast to country areas where dialects show a more regional uniformity.

SOC-PHJul 26, 2014

Crowdsourcing Dialect Characterization through Twitter

Bruno Gonçalves, David Sánchez

We perform a large-scale analysis of language diatopic variation using geotagged microblogging datasets. By collecting all Twitter messages written in Spanish over more than two years, we build a corpus from which a carefully selected list of concepts allows us to characterize Spanish varieties on a global scale. A cluster analysis proves the existence of well defined macroregions sharing common lexical properties. Remarkably enough, we find that Spanish language is split into two superdialects, namely, an urban speech used across major American and Spanish citites and a diverse form that encompasses rural areas and small towns. The latter can be further clustered into smaller varieties with a stronger regional character.