LGMay 1, 2022
Differentially Private Multivariate Time Series Forecasting of Aggregated Human Mobility With Deep Learning: Input or Gradient Perturbation?Héber H. Arcolezi, Jean-François Couchot, Denis Renaud et al.
This paper investigates the problem of forecasting multivariate aggregated human mobility while preserving the privacy of the individuals concerned. Differential privacy, a state-of-the-art formal notion, has been used as the privacy guarantee in two different and independent steps when training deep learning models. On one hand, we considered \textit{gradient perturbation}, which uses the differentially private stochastic gradient descent algorithm to guarantee the privacy of each time series sample in the learning stage. On the other hand, we considered \textit{input perturbation}, which adds differential privacy guarantees in each sample of the series before applying any learning. We compared four state-of-the-art recurrent neural networks: Long Short-Term Memory, Gated Recurrent Unit, and their Bidirectional architectures, i.e., Bidirectional-LSTM and Bidirectional-GRU. Extensive experiments were conducted with a real-world multivariate mobility dataset, which we published openly along with this paper. As shown in the results, differentially private deep learning models trained under gradient or input perturbation achieve nearly the same performance as non-private deep learning models, with loss in performance varying between $0.57\%$ to $2.8\%$. The contribution of this paper is significant for those involved in urban planning and decision-making, providing a solution to the human mobility multivariate forecast problem through differentially private deep learning models.
CRSep 16, 2022
De-Identification of French Unstructured Clinical Notes for Machine Learning TasksYakini Tchouka, Jean-François Couchot, Maxime Coulmeau et al.
Unstructured textual data are at the heart of health systems: liaison letters between doctors, operating reports, coding of procedures according to the ICD-10 standard, etc. The details included in these documents make it possible to get to know the patient better, to better manage him or her, to better study the pathologies, to accurately remunerate the associated medical acts\ldots All this seems to be (at least partially) within reach of today by artificial intelligence techniques. However, for obvious reasons of privacy protection, the designers of these AIs do not have the legal right to access these documents as long as they contain identifying data. De-identifying these documents, i.e. detecting and deleting all identifying information present in them, is a legally necessary step for sharing this data between two complementary worlds. Over the last decade, several proposals have been made to de-identify documents, mainly in English. While the detection scores are often high, the substitution methods are often not very robust to attack. In French, very few methods are based on arbitrary detection and/or substitution rules. In this paper, we propose a new comprehensive de-identification method dedicated to French-language medical documents. Both the approach for the detection of identifying elements (based on deep learning) and their substitution (based on differential privacy) are based on the most proven existing approaches. The result is an approach that effectively protects the privacy of the patients at the heart of these medical documents. The whole approach has been evaluated on a French language medical dataset of a French public hospital and the results are very encouraging.
CLApr 6, 2023
Automatic ICD-10 Code Association: A Challenging Task on French Clinical TextsYakini Tchouka, Jean-François Couchot, David Laiymani et al.
Automatically associating ICD codes with electronic health data is a well-known NLP task in medical research. NLP has evolved significantly in recent years with the emergence of pre-trained language models based on Transformers architecture, mainly in the English language. This paper adapts these models to automatically associate the ICD codes. Several neural network architectures have been experimented with to address the challenges of dealing with a large set of both input tokens and labels to be guessed. In this paper, we propose a model that combines the latest advances in NLP and multi-label classification for ICD-10 code association. Fair experiments on a Clinical dataset in the French language show that our approach increases the $F_1$-score metric by more than 55\% compared to state-of-the-art results.
CRNov 2, 2022
An Easy-to-use and Robust Approach for the Differentially Private De-Identification of Clinical Textual DocumentsYakini Tchouka, Jean-François Couchot, David Laiymani
Unstructured textual data is at the heart of healthcare systems. For obvious privacy reasons, these documents are not accessible to researchers as long as they contain personally identifiable information. One way to share this data while respecting the legislative framework (notably GDPR or HIPAA) is, within the medical structures, to de-identify it, i.e. to detect the personal information of a person through a Named Entity Recognition (NER) system and then replacing it to make it very difficult to associate the document with the person. The challenge is having reliable NER and substitution tools without compromising confidentiality and consistency in the document. Most of the conducted research focuses on English medical documents with coarse substitutions by not benefiting from advances in privacy. This paper shows how an efficient and differentially private de-identification approach can be achieved by strengthening the less robust de-identification method and by adapting state-of-the-art differentially private mechanisms for substitution purposes. The result is an approach for de-identifying clinical documents in French language, but also generalizable to other languages and whose robustness is mathematically proven.
6.3AIApr 21
A Dual Perspective on Synthetic Trajectory Generators: Utility Framework and Privacy VulnerabilitiesAya Cherigui, Florent Guépin, Arnaud Legendre et al.
Human mobility data are used in numerous applications, ranging from public health to urban planning. Human mobility is inherently sensitive, as it can contain information such as religious beliefs and political affiliations. Historically, it has been proposed to modify the information using techniques such as aggregation, obfuscation, or noise addition, to adequately protect privacy and eliminate concerns. As these methods come at a great cost in utility, new methods leveraging development in generative models, were introduced. The extent to which such methods answer the privacy-utility trade-off remains an open problem. In this paper, we introduced a first step towards solving it, by the introduction and application of a new framework for utility evaluation. Furthermore, we provide evidence that privacy evaluation remains a great challenge to consider and that it should be tackled through adversarial evaluation in accordance with the current EU regulation. We propose a new membership inference attack against a subcategory of generative models, even though this subcategory was deemed private due to its resistance over the trajectory user-linking problem.
CRNov 8, 2021
Improving the utility of locally differentially private protocols for longitudinal and multidimensional frequency estimatesHéber H. Arcolezi, Jean-François Couchot, Bechara Al Bouna et al.
This paper investigates the problem of collecting multidimensional data throughout time (i.e., longitudinal studies) for the fundamental task of frequency estimation under Local Differential Privacy (LDP) guarantees. Contrary to frequency estimation of a single attribute, the multidimensional aspect demands particular attention to the privacy budget. Besides, when collecting user statistics longitudinally, privacy progressively degrades. Indeed, the "multiple" settings in combination (i.e., many attributes and several collections throughout time) impose several challenges, for which this paper proposes the first solution for frequency estimates under LDP. To tackle these issues, we extend the analysis of three state-of-the-art LDP protocols (Generalized Randomized Response -- GRR, Optimized Unary Encoding -- OUE, and Symmetric Unary Encoding -- SUE) for both longitudinal and multidimensional data collections. While the known literature uses OUE and SUE for two rounds of sanitization (a.k.a. memoization), i.e., L-OUE and L-SUE, respectively, we analytically and experimentally show that starting with OUE and then with SUE provides higher data utility (i.e., L-OSUE). Also, for attributes with small domain sizes, we propose Longitudinal GRR (L-GRR), which provides higher utility than the other protocols based on unary encoding. Last, we also propose a new solution named Adaptive LDP for LOngitudinal and Multidimensional FREquency Estimates (ALLOMFREE), which randomly samples a single attribute to be sent with the whole privacy budget and adaptively selects the optimal protocol, i.e., either L-GRR or L-OSUE. As shown in the results, ALLOMFREE consistently and considerably outperforms the state-of-the-art L-SUE and L-OUE protocols in the quality of the frequency estimates.
CRSep 15, 2021
Random Sampling Plus Fake Data: Multidimensional Frequency Estimates With Local Differential PrivacyHéber H. Arcolezi, Jean-François Couchot, Bechara Al Bouna et al.
With local differential privacy (LDP), users can privatize their data and thus guarantee privacy properties before transmitting it to the server (a.k.a. the aggregator). One primary objective of LDP is frequency (or histogram) estimation, in which the aggregator estimates the number of users for each possible value. In practice, when a study with rich content on a population is desired, the interest is in the multiple attributes of the population, that is to say, in multidimensional data ($d \geq 2$). However, contrary to the problem of frequency estimation of a single attribute (the majority of the works), the multidimensional aspect imposes to pay particular attention to the privacy budget. This one can indeed grow extremely quickly due to the composition theorem. To the authors' knowledge, two solutions seem to stand out for this task: 1) splitting the privacy budget for each attribute, i.e., send each value with $\fracε{d}$-LDP (Spl), and 2) random sampling a single attribute and spend all the privacy budget to send it with $ε$-LDP (Smp). Although Smp adds additional sampling error, it has proven to provide higher data utility than the former Spl solution. However, we argue that aggregators (who are also seen as attackers) are aware of the sampled attribute and its LDP value, which is protected by a "less strict" $e^ε$ probability bound (rather than $e^{ε/d}$). This way, we propose a solution named Random Sampling plus Fake Data (RS+FD), which allows creating uncertainty over the sampled attribute by generating fake data for each non-sampled attribute; RS+FD further benefits from amplification by sampling. We theoretically and experimentally validate our proposed solution on both synthetic and real-world datasets to show that RS+FD achieves nearly the same or better utility than the state-of-the-art Smp solution.
MMJun 27, 2017
A Robust Data Hiding Process Contributing to the Development of a Semantic WebJacques M. Bahi, Jean-François Couchot, Nicolas Friot et al.
In this paper, a novel steganographic scheme based on chaotic iterations is proposed. This research work takes place into the information hiding framework, and focus more specifically on robust steganography. Steganographic algorithms can participate in the development of a semantic web: medias being on the Internet can be enriched by information related to their contents, authors, etc., leading to better results for the search engines that can deal with such tags. As media can be modified by users for various reasons, it is preferable that these embedding tags can resist to changes resulting from some classical transformations as for example cropping, rotation, image conversion, and so on. This is why a new robust watermarking scheme for semantic search engines is proposed in this document. For the sake of completeness, the robustness of this scheme is finally compared to existing established algorithms.
PEJun 25, 2017
Well-supported phylogenies using largest subsets of core-genes by discrete particle swarm optimizationReem Alsrraj, Bassam AlKindy, Christophe Guyeux et al.
The number of complete chloroplastic genomes increases day after day, making it possible to rethink plants phylogeny at the biomolecular era. Given a set of close plants sharing in the order of one hundred of core chloroplastic genes, this article focuses on how to extract the largest subset of sequences in order to obtain the most supported species tree. Due to computational complexity, a discrete and distributed Particle Swarm Optimization (DPSO) is proposed. It is finally applied to the core genes of Rosales order.
CRJun 25, 2017
One random jump and one permutation: sufficient conditions to chaotic, statistically faultless, and large throughput PRNG for FPGAMohammed Bakiri, Jean-François Couchot, Christophe Guyeux
Sub-categories of mathematical topology, like the mathematical theory of chaos, offer interesting applications devoted to information security. In this research work, we have introduced a new chaos-based pseudorandom number generator implemented in FPGA, which is mainly based on the deletion of a Hamilton cycle within the $n$-cube (or on the vectorial negation), plus one single permutation. By doing so, we produce a kind of post-treatment on hardware pseudorandom generators, but the obtained generator has usually a better statistical profile than its input, while running at a similar speed. We tested 6 combinations of Boolean functions and strategies that all achieve to pass the most stringent TestU01 battery of tests. This generation can reach a throughput/latency ratio equal to 6.7 Gbps, being thus the second fastest FPGA generator that can pass TestU01.
CRFeb 8, 2017
Random Walk in a N-cube Without Hamiltonian Cycle to Chaotic Pseudorandom Number Generation: Theoretical and Practical ConsiderationsSylvain Contassot-Vivier, Jean-François Couchot, Christophe Guyeux et al.
Designing a pseudorandom number generator (PRNG) is a difficult and complex task. Many recent works have considered chaotic functions as the basis of built PRNGs: the quality of the output would indeed be an obvious consequence of some chaos properties. However, there is no direct reasoning that goes from chaotic functions to uniform distribution of the output. Moreover, embedding such kind of functions into a PRNG does not necessarily allow to get a chaotic output, which could be required for simulating some chaotic behaviors. In a previous work, some of the authors have proposed the idea of walking into a $\mathsf{N}$-cube where a balanced Hamiltonian cycle has been removed as the basis of a chaotic PRNG. In this article, all the difficult issues observed in the previous work have been tackled. The chaotic behavior of the whole PRNG is proven. The construction of the balanced Hamiltonian cycle is theoretically and practically solved. An upper bound of the expected length of the walk to obtain a uniform distribution is calculated. Finally practical experiments show that the generators successfully pass the classical statistical tests.
CRNov 25, 2016
FPGA Implementation of $\mathbb{F}_2$-Linear Pseudorandom Number Generators Based on Zynq MPSoC: a Chaotic Iterations Post Processing Case StudyMohammed Bakiri, Jean-François Couchot, Christophe Guyeux
Pseudorandom number generation (PRNG) is a key element in hardware security platforms like field-programmable gate array FPGA circuits. In this article, 18 PRNGs belonging in 4 families (xorshift, LFSR, TGFSR, and LCG) are physically implemented in a FPGA and compared in terms of area, throughput, and statistical tests. Two flows of conception are used for Register Transfer Level (RTL) and High-level Synthesis (HLS). Additionally, the relations between linear complexity, seeds, and arithmetic operations on the one hand, and the resources deployed in FPGA on the other hand, are deeply investigated. In order to do that, a SoC based on Zynq EPP with ARM Cortex-$A9$ MPSoC is developed to accelerate the implementation and the tests of various PRNGs on FPGA hardware. A case study is finally proposed using chaotic iterations as a post processing for FPGA. The latter has improved the statistical profile of a combination of PRNGs that, without it, failed in the so-called TestU01 statistical battery of tests.
MMNov 25, 2016
A Second Order Derivatives based Approach for SteganographyJean-François Couchot, Raphaël Couturier, Yousra Ahmed Fadil et al.
Steganography schemes are designed with the objective of minimizing a defined distortion function. In most existing state of the art approaches, this distortion function is based on image feature preservation. Since smooth regions or clean edges define image core, even a small modification in these areas largely modifies image features and is thus easily detectable. On the contrary, textures, noisy or chaotic regions are so difficult to model that the features having been modified inside these areas are similar to the initial ones. These regions are characterized by disturbed level curves. This work presents a new distortion function for steganography that is based on second order derivatives, which are mathematical tools that usually evaluate level curves. Two methods are explained to compute these partial derivatives and have been completely implemented. The first experiments show that these approaches are promising.
AIAug 31, 2016
Binary Particle Swarm Optimization versus Hybrid Genetic Algorithm for Inferring Well Supported Phylogenetic TreesBassam AlKindy, Bashar Al-Nuaimi, Christophe Guyeux et al.
The amount of completely sequenced chloroplast genomes increases rapidly every day, leading to the possibility to build large-scale phylogenetic trees of plant species. Considering a subset of close plant species defined according to their chloroplasts, the phylogenetic tree that can be inferred by their core genes is not necessarily well supported, due to the possible occurrence of problematic genes (i.e., homoplasy, incomplete lineage sorting, horizontal gene transfers, etc.) which may blur the phylogenetic signal. However, a trustworthy phylogenetic tree can still be obtained provided such a number of blurring genes is reduced. The problem is thus to determine the largest subset of core genes that produces the best-supported tree. To discard problematic genes and due to the overwhelming number of possible combinations, this article focuses on how to extract the largest subset of sequences in order to obtain the most supported species tree. Due to computational complexity, a distributed Binary Particle Swarm Optimization (BPSO) is proposed in sequential and distributed fashions. Obtained results from both versions of the BPSO are compared with those computed using an hybrid approach embedding both genetic algorithms and statistical tests. The proposal has been applied to different cases of plant families, leading to encouraging results for these families.
CRAug 21, 2016
Quality Analysis of a Chaotic Proven Keyed Hash FunctionJacques M. Bahi, Jean-François Couchot, Christophe Guyeux
Hash functions are cryptographic tools, which are notably involved in integrity checking and password storage. They are of primary importance to improve the security of exchanges through the Internet. However, as security flaws have been recently identified in the current standard in this domain, new ways to hash digital data must be investigated. In this document an original keyed hash function is evaluated. It is based on asynchronous iterations leading to functions that have been proven to be chaotic. It thus possesses various topological properties as uniformity and sensibility to its initial condition. These properties make our hash function satisfies established security requirements in this field. This claim is qualitatively proven and experimentally verified in this research work, among other things by realizing a large number of simulations.
NEAug 21, 2016
Neural Networks and Chaos: Construction, Evaluation of Chaotic Networks, and Prediction of Chaos with Multilayer Feedforward NetworksJacques M. Bahi, Jean-François Couchot, Christophe Guyeux et al.
Many research works deal with chaotic neural networks for various fields of application. Unfortunately, up to now these networks are usually claimed to be chaotic without any mathematical proof. The purpose of this paper is to establish, based on a rigorous theoretical framework, an equivalence between chaotic iterations according to Devaney and a particular class of neural networks. On the one hand we show how to build such a network, on the other hand we provide a method to check if a neural network is a chaotic one. Finally, the ability of classical feedforward multilayer perceptrons to learn sets of data obtained from a dynamical system is regarded. Various Boolean functions are iterated on finite states. Iterations of some of them are proven to be chaotic as it is defined by Devaney. In that context, important differences occur in the training process, establishing with various neural networks that chaotic behaviors are far more difficult to learn.
MMAug 20, 2016
Steganalyzer performances in operational contextsYousra A. Fadil, Jean-François Couchot, Raphaël Couturier et al.
Steganography and steganalysis are two important branches of the information hiding field of research. Steganography methods consist in hiding information in such a way that the secret message is undetectable for the uninitiated. Steganalyzis encompasses all the techniques that attempt to detect the presence of such hidden information. This latter is usually designed by making classifiers able to separate innocent images from steganographied ones according to their differences on well-selected features. We wonder, in this article whether it is possible to construct a kind of universal steganalyzer without any knowledge regarding the steganographier side. The effects on the classification score of a modification of either parameters or methods between the learning and testing stages are then evaluated, while the possibility to improve the separation score by merging many methods during learning stage is deeper investigated.
MMMay 25, 2016
Steganalysis via a Convolutional Neural Network using Large Convolution Filters for Embedding Process with Same Stego KeyJean-François Couchot, Raphaël Couturier, Christophe Guyeux et al.
For the past few years, in the race between image steganography and steganalysis, deep learning has emerged as a very promising alternative to steganalyzer approaches based on rich image models combined with ensemble classifiers. A key knowledge of image steganalyzer, which combines relevant image features and innovative classification procedures, can be deduced by a deep learning approach called Convolutional Neural Networks (CNN). These kind of deep learning networks is so well-suited for classification tasks based on the detection of variations in 2D shapes that it is the state-of-the-art in many image recognition problems. In this article, we design a CNN-based steganalyzer for images obtained by applying steganography with a unique embedding key. This one is quite different from the previous study of {\em Qian et al.} and its successor, namely {\em Pibre et al.} The proposed architecture embeds less convolutions, with much larger filters in the final convolutional layer, and is more general: it is able to deal with larger images and lower payloads. For the "same embedding key" scenario, our proposal outperforms all other steganalyzers, in particular the existing CNN-based ones, and defeats many state-of-the-art image steganography schemes.
AIApr 20, 2015
Hybrid Genetic Algorithm and Lasso Test Approach for Inferring Well Supported Phylogenetic Trees based on Subsets of Chloroplastic Core GenesBassam AlKindy, Christophe Guyeux, Jean-François Couchot et al.
The amount of completely sequenced chloroplast genomes increases rapidly every day, leading to the possibility to build large scale phylogenetic trees of plant species. Considering a subset of close plant species defined according to their chloroplasts, the phylogenetic tree that can be inferred by their core genes is not necessarily well supported, due to the possible occurrence of "problematic" genes (i.e., homoplasy, incomplete lineage sorting, horizontal gene transfers, etc.) which may blur phylogenetic signal. However, a trustworthy phylogenetic tree can still be obtained if the number of problematic genes is low, the problem being to determine the largest subset of core genes that produces the best supported tree. To discard problematic genes and due to the overwhelming number of possible combinations, we propose an hybrid approach that embeds both genetic algorithms and statistical tests. Given a set of organisms, the result is a pipeline of many stages for the production of well supported phylogenetic trees. The proposal has been applied to different cases of plant families, leading to encouraging results for these families.
NEDec 17, 2014
Gene Similarity-based Approaches for Determining Core-Genes of ChloroplastsBassam AlKindy, Christophe Guyeux, Jean-François Couchot et al.
In computational biology and bioinformatics, the manner to understand evolution processes within various related organisms paid a lot of attention these last decades. However, accurate methodologies are still needed to discover genes content evolution. In a previous work, two novel approaches based on sequence similarities and genes features have been proposed. More precisely, we proposed to use genes names, sequence similarities, or both, insured either from NCBI or from DOGMA annotation tools. Dogma has the advantage to be an up-to-date accurate automatic tool specifically designed for chloroplasts, whereas NCBI possesses high quality human curated genes (together with wrongly annotated ones). The key idea of the former proposal was to take the best from these two tools. However, the first proposal was limited by name variations and spelling errors on the NCBI side, leading to core trees of low quality. In this paper, these flaws are fixed by improving the comparison of NCBI and DOGMA results, and by relaxing constraints on gene names while adding a stage of post-validation on gene sequences. The two stages of similarity measures, on names and sequences, are thus proposed for sequence clustering. This improves results that can be obtained using either NCBI or DOGMA alone. Results obtained with this quality control test are further investigated and compared with previously released ones, on both computational and biological aspects, considering a set of 99 chloroplastic genomes.
CRFeb 23, 2012
Application of Steganography for Anonymity through the InternetJacques M. Bahi, Jean-François Couchot, Nicolas Friot et al.
In this paper, a novel steganographic scheme based on chaotic iterations is proposed. This research work takes place into the information hiding security framework. The applications for anonymity and privacy through the Internet are regarded too. To guarantee such an anonymity, it should be possible to set up a secret communication channel into a web page, being both secure and robust. To achieve this goal, we propose an information hiding scheme being stego-secure, which is the highest level of security in a well defined and studied category of attacks called "watermark-only attack". This category of attacks is the best context to study steganography-based anonymity through the Internet. The steganalysis of our steganographic process is also studied in order to show it security in a real test framework.