José Camacho

7papers

64citations

Novelty24%

AI Score30

Ranked #144,812 of 201,326 authors (top 72%)#4,046 in CR (top 55%)

7 Papers

MENov 24, 2025

A Set of Rules for Model Validation

José Camacho

The validation of a data-driven model is the process of assessing the model's ability to generalize to new, unseen data in the population of interest. This paper proposes a set of general rules for model validation. These rules are designed to help practitioners create reliable validation plans and report their results transparently. While no validation scheme is flawless, these rules can help practitioners ensure their strategy is sufficient for practical use, openly discuss any limitations of their validation strategy, and report clear, comparable performance metrics.

LGMay 31, 2023

Quality In / Quality Out: Data quality more relevant than model choice in anomaly detection with the UGR'16

José Camacho, Katarzyna Wasielewska, Pablo Espinosa et al.

Autonomous or self-driving networks are expected to provide a solution to the myriad of extremely demanding new applications with minimal human supervision. For this purpose, the community relies on the development of new Machine Learning (ML) models and techniques. %, like the celebrated Deep Learning (DL). However, ML can only be as good as the data it is fitted with, and data quality is an elusive concept difficult to assess. In this paper, we show that relatively minor modifications on a benchmark dataset (UGR'16, a flow-based real-traffic dataset for anomaly detection) cause significantly more impact on model performance than the specific ML technique considered. We also show that the measured model performance is uncertain, as a result of labelling inaccuracies. Our findings illustrate that the widely adopted approach of comparing a set of models in terms of performance results (e.g., in terms of accuracy or ROC curves) may lead to incorrect conclusions when done without a proper understanding of dataset biases and sensitivity. We contribute a methodology to interpret a model response that can be useful for this understanding.

CRJul 31, 2019

MSNM-Sensor: An Applied Network Monitoring Tool for Anomaly Detection in Complex Networks and Systems

Roberto Magán-Carrión, José Camacho, Gabriel Maciá-Fernández et al.

Technology evolves quickly. Low-cost and ready-to-connect devices are designed to provide new services and applications. Smart grids or smart healthcare systems are some examples of these applications, all of which are in the context of smart cities. In this total-connectivity scenario, some security issues arise since the larger the number of connected devices is, the greater the surface attack dimension. In this way, new solutions for monitoring and detecting security events are needed to address new challenges brought about by this scenario, among others, the large number of devices to monitor, the large amount of data to manage and the real-time requirement to provide quick security event detection and, consequently, quick response to attacks. In this work, a practical and ready-to-use tool for monitoring and detecting security events in these environments is developed and introduced. The tool is based on the Multivariate Statistical Network Monitoring (MSNM) methodology for monitoring and anomaly detection and we call it MSNM-Sensor. Although it is in its early development stages, experimental results based on the detection of well-known attacks in hierarchical network systems prove the suitability of this tool for more complex scenarios, such as those found in smart cities or IoT ecosystems.

NIJul 5, 2019

Interpretable Feature Learning in Multivariate Big Data Analysis for Network Monitoring

José Camacho, Katarzyna Wasielewska, Rasmus Bro et al.

There is an increasing interest in the development of new data-driven models useful to assess the performance of communication networks. For many applications, like network monitoring and troubleshooting, a data model is of little use if it cannot be interpreted by a human operator. In this paper, we present an extension of the Multivariate Big Data Analysis (MBDA) methodology, a recently proposed interpretable data analysis tool. In this extension, we propose a solution to the automatic derivation of features, a cornerstone step for the application of MBDA when the amount of data is massive. The resulting network monitoring approach allows us to detect and diagnose disparate network anomalies, with a data-analysis workflow that combines the advantages of interpretable and interactive models with the power of parallel processing. We apply the extended MBDA to two case studies: UGR'16, a benchmark flow-based real-traffic dataset for anomaly detection, and Dartmouth'18, the longest and largest Wi-Fi trace known to date.

MLJun 28, 2019

Cross-product Penalized Component Analysis (XCAN)

José Camacho, Evrim Acar, Morten A. Rasmussen et al.

Matrix factorization methods are extensively employed to understand complex data. In this paper, we introduce the cross-product penalized component analysis (XCAN), a sparse matrix factorization based on the optimization of a loss function that allows a trade-off between variance maximization and structural preservation. The approach is based on previous developments, notably (i) the Sparse Principal Component Analysis (SPCA) framework based on the LASSO, (ii) extensions of SPCA to constrain both modes of the factorization, like co-clustering or the Penalized Matrix Decomposition (PMD), and (iii) the Group-wise Principal Component Analysis (GPCA) method. The result is a flexible modeling approach that can be used for data exploration in a large variety of problems. We demonstrate its use with applications from different disciplines.

NIJun 27, 2019

Multivariate Big Data Analysis for Intrusion Detection: 5 steps from the haystack to the needle

José Camacho, José Manuel García-Giménez, Noemí Marta Fuentes-García et al.

The research literature on cybersecurity incident detection & response is very rich in automatic detection methodologies, in particular those based on the anomaly detection paradigm. However, very little attention has been devoted to the diagnosis ability of the methods, aimed to provide useful information on the causes of a given detected anomaly. This information is of utmost importance for the security team to reduce the time from detection to response. In this paper, we present Multivariate Big Data Analysis (MBDA), a complete intrusion detection approach based on 5 steps to effectively handle massive amounts of disparate data sources. The approach has been designed to deal with the main characteristics of Big Data, that is, the high volume, velocity and variety. The core of the approach is the Multivariate Statistical Network Monitoring (MSNM) technique proposed in a recent paper. Unlike in state of the art machine learning methodologies applied to the intrusion detection problem, when an anomaly is identified in MBDA the output of the system includes the detail of the logs of raw information associated to this anomaly, so that the security team can use this information to elucidate its root causes. MBDA is based in two open software packages available in Github: the MEDA Toolbox and the FCParser. We illustrate our approach with two case studies. The first one demonstrates the application of MBDA to semistructured sources of information, using the data from the VAST 2012 mini challenge 2. This complete case study is supplied in a virtual machine available for download. In the second case study we show the Big Data capabilities of the approach in data collected from a real network with labeled attacks.

CRJun 6, 2017

On the Feasibility of Distinguishing Between Process Disturbances and Intrusions in Process Control Systems Using Multivariate Statistical Process Control

Mikel Iturbe, José Camacho, Iñaki Garitano et al.

Process Control Systems (PCSs) are the operating core of Critical Infrastructures (CIs). As such, anomaly detection has been an active research field to ensure CI normal operation. Previous approaches have leveraged network level data for anomaly detection, or have disregarded the existence of process disturbances, thus opening the possibility of mislabelling disturbances as attacks and vice versa. In this paper we present an anomaly detection and diagnostic system based on Multivariate Statistical Process Control (MSPC), that aims to distinguish between attacks and disturbances. For this end, we expand traditional MSPC to monitor process level and controller level data. We evaluate our approach using the Tennessee-Eastman process. Results show that our approach can be used to distinguish disturbances from intrusions to a certain extent and we conclude that the proposed approach can be extended with other sources of data for improving results.