Carlos Sarraute

CR
21papers
580citations
Novelty38%
AI Score23

21 Papers

SIFeb 23, 2020
The Value of Big Data for Credit Scoring: Enhancing Financial Inclusion using Mobile Phone Data and Social Network Analytics

María Óskarsdóttir, Cristián Bravo, Carlos Sarraute et al.

Credit scoring is without a doubt one of the oldest applications of analytics. In recent years, a multitude of sophisticated classification techniques have been developed to improve the statistical performance of credit scoring models. Instead of focusing on the techniques themselves, this paper leverages alternative data sources to enhance both statistical and economic model performance. The study demonstrates how including call networks, in the context of positive credit information, as a new Big Data source has added value in terms of profit by applying a profit measure and profit-based feature selection. A unique combination of datasets, including call-detail records, credit and debit account information of customers is used to create scorecards for credit card applicants. Call-detail records are used to build call networks and advanced social network analytics techniques are applied to propagate influence from prior defaulters throughout the network to produce influence scores. The results show that combining call-detail records with traditional data in credit scoring models significantly increases their performance when measured in AUC. In terms of profit, the best model is the one built with only calling behavior features. In addition, the calling behavior features are the most predictive in other models, both in terms of statistical and economic performance. The results have an impact in terms of ethical use of call-detail records, regulatory implications, financial inclusion, as well as data sharing and privacy.

CRFeb 22, 2020
Fair and Decentralized Exchange of Digital Goods

Ariel Futoransky, Carlos Sarraute, Daniel Fernandez et al.

We construct a privacy-preserving, distributed and decentralized marketplace where parties can exchange data for tokens. In this market, buyers and sellers make transactions in a blockchain and interact with a third party, called notary, who has the ability to vouch for the authenticity and integrity of the data. We introduce a protocol for the data-token exchange where neither party gains more information than what it is paying for, and the exchange is fair: either both parties gets the other's item or neither does. No third party involvement is required after setup, and no dispute resolution is needed.

CRFeb 10, 2020
WibsonTree: Efficiently Preserving Seller's Privacy in a Decentralized Data Marketplace

Ariel Futoransky, Carlos Sarraute, Ariel Waissbein et al.

We present a cryptographic primitive called WibsonTree designed to preserve users' privacy by allowing them to demonstrate predicates on their personal attributes, without revealing the values of those attributes. We suppose that there are three types of agents --buyers, sellers and notaries-- who interact in a decentralized privacy-preserving data marketplace (dPDM) such as the Wibson marketplace. We introduce the WibsonTree protocol as an efficient cryptographic primitive that enables the exchange of private information while preserving the seller's privacy. Using our primitive, a data seller can efficiently prove that he/she belongs to the target audience of a buyer's data request, without revealing any additional information.

CRFeb 6, 2020
BatPay: a gas efficient protocol for the recurrent micropayment of ERC20 tokens

Hartwig Mayer, Ismael Bejarano, Daniel Fernandez et al.

BatPay is a proxy scaling solution for the transfer of ERC20 tokens. It is suitable for micropayments in one-to-many and few-to-many scenarios, including digital markets and the distribution of rewards and dividends. In BatPay, many similar operations are bundled together into a single transaction in order to optimize gas consumption on the Ethereum blockchain. In addition, some costly verifications are replaced by a challenge game, pushing most of the computing cost off-chain. This results in a gas reduction of the transfer costs of three orders of magnitude, achieving around 1700 transactions per second on the Ethereum blockchain. Furthermore, it includes many relevant features, like meta-transactions for end-user operation without ether, and key-locked payments for atomic exchange of digital goods.

CRJan 23, 2020
Wibson Protocol for Secure Data Exchange and Batch Payments

Daniel Fernandez, Ariel Futoransky, Gustavo Ajzenman et al.

Wibson is a blockchain-based, decentralized data marketplace that provides individuals a way to securely and anonymously sell information in a trusted environment. The combination of the Wibson token and blockchain-enabled smart contracts hopes to allow Data Sellers and Data Buyers to transact with each other directly while providing individuals the ability to maintain anonymity as desired. The Wibson marketplace will provide infrastructure and financial incentives for individuals to securely sell personal information without sacrificing personal privacy. Data Buyers receive information from willing and actively participating individuals with the benefit of knowing that the personal information should be accurate and current. We present here two different components working together to achieve an efficient decentralized marketplace. The first is a smart contract called Data Exchange, which stores references to Data Orders that different Buyers open in order to show to the market that they are interested in buying certain types of data, and provides secure mechanisms to perform the transactions. The second is used to process payments from Buyers to Sellers and intermediaries, and is called Batch Payments.

CRJul 29, 2019
Secure Exchange of Digital Goods in a Decentralized Data Marketplace

Ariel Futoransky, Carlos Sarraute, Ariel Waissbein et al.

We are tackling the problem of trading real-world private information using only cryptographic protocols and a public blockchain to guarantee honest transactions. In this project, we consider three types of agents --buyers, sellers and notaries-- interacting in a decentralized privacy-preserving data marketplace (dPDM) such as the Wibson data marketplace. This framework offers infrastructure and financial incentives for individuals to securely sell personal information while preserving personal privacy. Here we provide an efficient cryptographic primitive for the secure exchange of data in a dPDM, which occurs as an atomic operation wherein the data buyer gets access to the data and the data seller gets paid simultaneously.

CYDec 24, 2018
Wibson: A Decentralized Data Marketplace

Matias Travizano, Carlos Sarraute, Gustavo Ajzenman et al.

Our aim is for Wibson to be a blockchain-based, decentralized data marketplace that provides individuals a way to securely and anonymously sell information in a trusted environment. The combination of the Wibson token and blockchain-enabled smart contracts hopes to allow Data Sellers and Data Buyers to transact with each other directly while providing individuals the ability to maintain anonymity as desired. Wibson intends that its data marketplace will provide infrastructure and financial incentives for individuals to securely sell personal information without sacrificing personal privacy. Data Buyers receive information from willing and actively participating individuals with the benefit of knowing that the personal information should be accurate and current.

SIDec 3, 2018
Brief survey of Mobility Analyses based on Mobile Phone Datasets

Carlos Sarraute, Martin Minnoni

This is a brief survey of the research performed by Grandata Labs in collaboration with numerous academic groups around the world on the topic of human mobility. A driving theme in these projects is to use and improve Data Science techniques to understand mobility, as it can be observed through the lens of mobile phone datasets. We describe applications of mobility analyses for urban planning, prediction of data traffic usage, building delay tolerant networks, generating epidemiologic risk maps and measuring the predictability of human mobility.

CYNov 13, 2018
Comparison of Feature Extraction Methods and Predictors for Income Inference

Martin Fixman, Martin Minnoni, Carlos Sarraute

Patterns of mobile phone communications, coupled with the information of the social network graph and financial behavior, allow us to make inferences of users' socio-economic attributes such as their income level. We present here several methods to extract features from mobile phone usage (calls and messages), and compare different combinations of supervised machine learning techniques and sets of features used as input for the inference of users' income. Our experimental results show that the Bayesian method based on the communication graph outperforms standard machine learning algorithms using node-based features.

CYNov 10, 2018
A Bayesian Approach to Income Inference in a Communication Network

Martin Fixman, Ariel Berenstein, Jorge Brea et al.

The explosion of mobile phone communications in the last years occurs at a moment where data processing power increases exponentially. Thanks to those two changes in a global scale, the road has been opened to use mobile phone communications to generate inferences and characterizations of mobile phone users. In this work, we use the communication network, enriched by a set of users' attributes, to gain a better understanding of the demographic features of a population. Namely, we use call detail records and banking information to infer the income of each person in the graph.

CYAug 9, 2018
Uncovering the Spread of Chagas Disease in Argentina and Mexico

Juan de Monasterio, Alejo Salles, Carolina Lang et al.

Chagas disease is a neglected disease, and information about its geographical spread is very scarse. We analyze here mobility and calling patterns in order to identify potential risk zones for the disease, by using public health information and mobile phone records. Geolocalized call records are rich in social and mobility information, which can be used to infer whether an individual has lived in an endemic area. We present two case studies in Latin American countries. Our objective is to generate risk maps which can be used by public health campaign managers to prioritize detection campaigns and target specific areas. Finally, we analyze the value of mobile phone data to infer long-term migrations, which play a crucial role in the geographical spread of Chagas disease.

SIAug 1, 2018
Inference of Users Demographic Attributes based on Homophily in Communication Networks

Jorge Brea, Javier Burroni, Carlos Sarraute

Over the past decade, mobile phones have become prevalent in all parts of the world, across all demographic backgrounds. Mobile phones are used by men and women across a wide age range in both developed and developing countries. Consequently, they have become one of the most important mechanisms for social interaction within a population, making them an increasingly important source of information to understand human demographics and human behaviour. In this work we combine two sources of information: communication logs from a major mobile operator in a Latin American country, and information on the demographics of a subset of the users population. This allows us to perform an observational study of mobile phone usage, differentiated by age groups categories. This study is interesting in its own right, since it provides knowledge on the structure and demographics of the mobile phone market in the studied country. We then tackle the problem of inferring the age group for all users in the network. We present here an exclusively graph-based inference method relying solely on the topological structure of the mobile network, together with a topological analysis of the performance of the algorithm. The equations for our algorithm can be described as a diffusion process with two added properties: (i) memory of its initial state, and (ii) the information is propagated as a probability vector for each node attribute (instead of the value of the attribute itself). Our algorithm can successfully infer different age groups within the network population given known values for a subset of nodes (seed nodes). Most interestingly, we show that by carefully analysing the topological relationships between correctly predicted nodes and the seed nodes, we can characterize particular subsets of nodes for which our inference method has significantly higher accuracy.

SIJun 30, 2017
Prepaid or Postpaid? That is the question. Novel Methods of Subscription Type Prediction in Mobile Phone Services

Yongjun Liao, Wei Du, Márton Karsai et al.

In this paper we investigate the behavioural differences between mobile phone customers with prepaid and postpaid subscriptions. Our study reveals that (a) postpaid customers are more active in terms of service usage and (b) there are strong structural correlations in the mobile phone call network as connections between customers of the same subscription type are much more frequent than those between customers of different subscription types. Based on these observations we provide methods to detect the subscription type of customers by using information about their personal call statistics, and also their egocentric networks simultaneously. The key of our first approach is to cast this classification problem as a problem of graph labelling, which can be solved by max-flow min-cut algorithms. Our experiments show that, by using both user attributes and relationships, the proposed graph labelling approach is able to achieve a classification accuracy of $\sim 87\%$, which outperforms by $\sim 7\%$ supervised learning methods using only user attributes. In our second problem we aim to infer the subscription type of customers of external operators. We propose via approximate methods to solve this problem by using node attributes, and a two-ways indirect inference method based on observed homophilic structural correlations. Our results have straightforward applications in behavioural prediction and personal marketing.

CRJul 31, 2013
An Oblivious Password Cracking Server

Aureliano Calvo, Ariel Futoransky, Carlos Sarraute

Building a password cracking server that preserves the privacy of the queries made to the server is a problem that has not yet been solved. Such a server could acquire practical relevance in the future: for instance, the tables used to crack the passwords could be calculated, stored and hosted in cloud-computing services, and could be queried from devices with limited computing power. In this paper we present a method to preserve the confidentiality of a password cracker---wherein the tables used to crack the passwords are stored by a third party---by combining Hellman tables and Private Information Retrieval (PIR) protocols. We provide the technical details of this method, analyze its complexity, and show the experimental results obtained with our implementation.

AIJul 31, 2013
POMDPs Make Better Hackers: Accounting for Uncertainty in Penetration Testing

Carlos Sarraute, Olivier Buffet, Joerg Hoffmann

Penetration Testing is a methodology for assessing network security, by generating and executing possible hacking attacks. Doing so automatically allows for regular and systematic testing. A key question is how to generate the attacks. This is naturally formulated as planning under uncertainty, i.e., under incomplete knowledge about the network configuration. Previous work uses classical planning, and requires costly pre-processes reducing this uncertainty by extensive application of scanning methods. By contrast, we herein model the attack planning problem in terms of partially observable Markov decision processes (POMDP). This allows to reason about the knowledge available, and to intelligently employ scanning actions as part of the attack. As one would expect, this accurate solution does not scale. We devise a method that relies on POMDPs to find good attacks on individual machines, which are then composed into an attack on the network as a whole. This decomposition exploits network structure to the extent possible, making targeted approximations (only) where needed. Evaluating this method on a suitably adapted industrial test suite, we demonstrate its effectiveness in both runtime and solution quality.

AIJul 30, 2013
Les POMDP font de meilleurs hackers: Tenir compte de l'incertitude dans les tests de penetration

Carlos Sarraute, Olivier Buffet, Joerg Hoffmann

Penetration Testing is a methodology for assessing network security, by generating and executing possible hacking attacks. Doing so automatically allows for regular and systematic testing. A key question is how to generate the attacks. This is naturally formulated as planning under uncertainty, i.e., under incomplete knowledge about the network configuration. Previous work uses classical planning, and requires costly pre-processes reducing this uncertainty by extensive application of scanning methods. By contrast, we herein model the attack planning problem in terms of partially observable Markov decision processes (POMDP). This allows to reason about the knowledge available, and to intelligently employ scanning actions as part of the attack. As one would expect, this accurate solution does not scale. We devise a method that relies on POMDPs to find good attacks on individual machines, which are then composed into an attack on the network as a whole. This decomposition exploits network structure to the extent possible, making targeted approximations (only) where needed. Evaluating this method on a suitably adapted industrial test suite, we demonstrate its effectiveness in both runtime and solution quality.

AIJul 30, 2013
Automated Attack Planning

Carlos Sarraute

Penetration Testing is a methodology for assessing network security, by generating and executing possible attacks. Doing so automatically allows for regular and systematic testing. A key question then is how to automatically generate the attacks. A natural way to address this issue is as an attack planning problem. In this thesis, we are concerned with the specific context of regular automated pentesting, and use the term "attack planning" in that sense. The following three research directions are investigated. First, we introduce a conceptual model of computer network attacks, based on an analysis of the penetration testing practices. We study how this attack model can be represented in the PDDL language. Then we describe an implementation that integrates a classical planner with a penetration testing tool. This allows us to automatically generate attack paths for real world pentesting scenarios, and to validate these attacks by executing them. Secondly, we present efficient probabilistic planning algorithms, specifically designed for this problem, that achieve industrial-scale runtime performance (able to solve scenarios with several hundred hosts and exploits). These algorithms take into account the probability of success of the actions and their expected cost (for example in terms of execution time, or network traffic generated). Finally, we take a different direction: instead of trying to improve the efficiency of the solutions developed, we focus on improving the model of the attacker. We model the attack planning problem in terms of partially observable Markov decision processes (POMDP). This grounds penetration testing in a well-researched formalism. POMDPs allow the modelling of information gathering as an integral part of the problem, thus providing for the first time a means to intelligently mix scanning actions with actual exploits.

AIJun 19, 2013
Penetration Testing == POMDP Solving?

Carlos Sarraute, Olivier Buffet, Joerg Hoffmann

Penetration Testing is a methodology for assessing network security, by generating and executing possible attacks. Doing so automatically allows for regular and systematic testing without a prohibitive amount of human labor. A key question then is how to generate the attacks. This is naturally formulated as a planning problem. Previous work (Lucangeli et al. 2010) used classical planning and hence ignores all the incomplete knowledge that characterizes hacking. More recent work (Sarraute et al. 2011) makes strong independence assumptions for the sake of scaling, and lacks a clear formal concept of what the attack planning problem actually is. Herein, we model that problem in terms of partially observable Markov decision processes (POMDP). This grounds penetration testing in a well-researched formalism, highlighting important aspects of this problem's nature. POMDPs allow to model information gathering as an integral part of the problem, thus providing for the first time a means to intelligently mix scanning actions with actual exploits.

CRJun 18, 2013
Attack Planning in the Real World

Jorge Lucangeli Obes, Carlos Sarraute, Gerardo Richarte

Assessing network security is a complex and difficult task. Attack graphs have been proposed as a tool to help network administrators understand the potential weaknesses of their network. However, a problem has not yet been addressed by previous work on this subject; namely, how to actually execute and validate the attack paths resulting from the analysis of the attack graph. In this paper we present a complete PDDL representation of an attack model, and an implementation that integrates a planner into a penetration testing tool. This allows to automatically generate attack paths for penetration testing scenarios, and to validate these attacks by executing the corresponding actions -including exploits- against the real target network. We present an algorithm for transforming the information present in the penetration testing tool to the planning domain, and show how the scalability issues of attack graphs can be solved using current planners. We include an analysis of the performance of our solution, showing how our model scales to medium-sized networks and the number of actions available in current penetration testing tools.

CRJun 17, 2013
An Algorithm to Find Optimal Attack Paths in Nondeterministic Scenarios

Carlos Sarraute, Gerardo Richarte, Jorge Lucangeli Obes

As penetration testing frameworks have evolved and have become more complex, the problem of controlling automatically the pentesting tool has become an important question. This can be naturally addressed as an attack planning problem. Previous approaches to this problem were based on modeling the actions and assets in the PDDL language, and using off-the-shelf AI tools to generate attack plans. These approaches however are limited. In particular, the planning is classical (the actions are deterministic) and thus not able to handle the uncertainty involved in this form of attack planning. We herein contribute a planning model that does capture the uncertainty about the results of the actions, which is modeled as a probability of success of each action. We present efficient planning algorithms, specifically designed for this problem, that achieve industrial-scale runtime performance (able to solve scenarios with several hundred hosts and exploits). These algorithms take into account the probability of success of the actions and their expected cost (for example in terms of execution time, or network traffic generated). We thus show that probabilistic attack planning can be solved efficiently for the scenarios that arise when assessing the security of large networks. Two "primitives" are presented, which are used as building blocks in a framework separating the overall problem into two levels of abstraction. We also present the experimental results obtained with our implementation, and conclude with some ideas for further work.

CRMay 21, 2013
Aplicacion de las Redes Neuronales al Reconocimiento de Sistemas Operativos

Carlos Sarraute

In this work we present a family of neural networks, the multi-layer perceptron networks, and some of the algorithms used to train those networks (we hope that with enough details and precision as to satisfy a mathematical public). Then we study how to use those networks to solve a problem that arises from the field of information security: the remote identification of Operating Systems (part of the information gathering steps of the penetration testing methodology). This is the contribution of this work: it is an application of classic Artificial Intelligence techniques to a classification problem that gave better results than the classic techniques used to solve it.