Luis de-Marcos

LG
4papers
118citations
Novelty34%
AI Score36

4 Papers

CLMar 6
Wisdom of the AI Crowd (AI-CROWD) for Ground Truth Approximation in Content Analysis: A Research Protocol & Validation Using Eleven Large Language Models

Luis de-Marcos, Manuel Goyanes, Adrián Domínguez-Díaz

Large-scale content analysis is increasingly limited by the absence of observable ground truth or gold-standard labels, as creating such benchmarks through extensive human coding becomes impractical for massive datasets due to high time, cost, and consistency challenges. To overcome this barrier, we introduce the AI-CROWD protocol, which approximates ground truth by leveraging the collective outputs of an ensemble of large language models (LLMs). Rather than asserting that the resulting labels are true ground truth, the protocol generates a consensus-based approximation derived from convergent and divergent inferences across multiple models. By aggregating outputs via majority voting and interrogating agreement/disagreement patterns with diagnostic metrics, AI-CROWD identifies high-confidence classifications while flagging potential ambiguity or model-specific biases.

CRDec 9, 2020
Interconnection between darknets

Carlos Cilleruelo, Luis de-Marcos, Javier Junquera-Sánchez et al.

Tor and i2p networks are two of the most popular darknets. Both darknets have become an area of illegal activities highlighting the necessity to study and analyze them to identify and report illegal content to Law Enforcement Agencies (LEAs). This paper analyzes the connections between the Tor network and the i2p network. We created the first dataset that combines information from Tor and i2p networks. The dataset contains more than 49k darknet services. The process of building and analyzing the dataset shows that it is not possible to explore one of the networks without considering the other. Both networks work as an ecosystem and there are clear paths between them. Using graph analysis, we also identified the most relevant domains, the prominent types of services in each network, and their relations. Findings are relevant to LEAs and researchers aiming to crawl and investigate i2p and Tor networks.

LGJan 31, 2019
Distributed Correlation-Based Feature Selection in Spark

Raul-Jose Palma-Mendoza, Luis de-Marcos, Daniel Rodriguez et al.

CFS (Correlation-Based Feature Selection) is an FS algorithm that has been successfully applied to classification problems in many domains. We describe Distributed CFS (DiCFS) as a completely redesigned, scalable, parallel and distributed version of the CFS algorithm, capable of dealing with the large volumes of data typical of big data applications. Two versions of the algorithm were implemented and compared using the Apache Spark cluster computing model, currently gaining popularity due to its much faster processing times than Hadoop's MapReduce model. We tested our algorithms on four publicly available datasets, each consisting of a large number of instances and two also consisting of a large number of features. The results show that our algorithms were superior in terms of both time-efficiency and scalability. In leveraging a computer cluster, they were able to handle larger datasets than the non-distributed WEKA version while maintaining the quality of the results, i.e., exactly the same features were returned by our algorithms when compared to the original algorithm available in WEKA.

LGNov 1, 2018
Distributed ReliefF based Feature Selection in Spark

Raul-Jose Palma-Mendoza, Daniel Rodriguez, Luis de-Marcos

Feature selection (FS) is a key research area in the machine learning and data mining fields, removing irrelevant and redundant features usually helps to reduce the effort required to process a dataset while maintaining or even improving the processing algorithm's accuracy. However, traditional algorithms designed for executing on a single machine lack scalability to deal with the increasing amount of data that has become available in the current Big Data era. ReliefF is one of the most important algorithms successfully implemented in many FS applications. In this paper, we present a completely redesigned distributed version of the popular ReliefF algorithm based on the novel Spark cluster computing model that we have called DiReliefF. Spark is increasing its popularity due to its much faster processing times compared with Hadoop's MapReduce model implementation. The effectiveness of our proposal is tested on four publicly available datasets, all of them with a large number of instances and two of them with also a large number of features. Subsets of these datasets were also used to compare the results to a non-distributed implementation of the algorithm. The results show that the non-distributed implementation is unable to handle such large volumes of data without specialized hardware, while our design can process them in a scalable way with much better processing times and memory usage.