GEO-PHAug 20, 2022
Data Centred Intelligent Geosciences: Research Agenda and Opportunities, Position PaperAderson Farias do Nascimento, Martin A. Musicante, Umberto Souza da Costa et al.
This paper describes and discusses our vision to develop and reason about best practices and novel ways of curating data-centric geosciences knowledge (data, experiments, models, methods, conclusions, and interpretations). This knowledge is produced from applying statistical modelling, Machine Learning, and modern data analytics methods on geo-data collections. The problems address open methodological questions in model building, models' assessment, prediction, and forecasting workflows.
HCNov 12, 2023
Conversational Data Exploration: A Game-Changer for Designing Data Science PipelinesGenoveva Vargas-Solar, Tania Cerquitelli, Javier A. Espinosa-Oviedo et al.
This paper proposes a conversational approach implemented by the system Chatin for driving an intuitive data exploration experience. Our work aims to unlock the full potential of data analytics and artificial intelligence with a new generation of data science solutions. Chatin is a cutting-edge tool that democratises access to AI-driven solutions, empowering non-technical users from various disciplines to explore data and extract knowledge from it.
DBOct 20, 2021
QoS-based Trust Evaluation for Data Services as a Black BoxSenda Romdhani, Genoveva Vargas-Solar, Nadia Bennani et al.
This paper proposes a QoS-based trust evaluation model for black box data services. Under the black-box model, data services neither export (meta)-data about conditions in which they are deployed and collect and process data nor the quality of data they deliver. Therefore, the black-box model creates blind spots about the extent to which data providers can be trusted to be used to build target applications. The trust evaluation model for black box data services introduced in this paper originally combines QoS indicators, like service performance and data quality, to determine services trustworthiness. The paper also introduces DETECT: a Data sErvice as a black box Trust Evaluation arChitecTure, that validates our model. The trust model and its associated monitoring strategies have been assessed in experiments with representative case studies. The results demonstrate the feasibility and effectiveness of our solution.
SEAug 5, 2021
TRANSMUT-SPARK: Transformation Mutation for Apache SparkJoao Batista de Souza Neto, Anamaria Martins Moreira, Genoveva Vargas-Solar et al.
We propose TRANSMUT-Spark, a tool that automates the mutation testing process of Big Data processing code within Spark programs. Apache Spark is an engine for Big Data Processing. It hides the complexity inherent to Big Data parallel and distributed programming and processing through built-in functions, underlying parallel processes, and data management strategies. Nonetheless, programmers must cleverly combine these functions within programs and guide the engine to use the right data management strategies to exploit the large number of computational resources required by Big Data processing and avoid substantial production losses. Many programming details in data processing code within Spark programs are prone to false statements that need to be correctly and automatically tested. This paper explores the application of mutation testing in Spark programs, a fault-based testing technique that relies on fault simulation to evaluate and design test sets. The paper introduces the TRANSMUT-Spark solution for testing Spark programs. TRANSMUT-Spark automates the most laborious steps of the process and fully executes the mutation testing process. The paper describes how the tool automates the mutants generation, test execution, and adequacy analysis phases of mutation testing with TRANSMUT-Spark. It also discusses the results of experiments that were carried out to validate the tool to argue its scope and limitations.
SEAug 5, 2021
An Abstract View of Big Data Processing ProgramsJoao Batista de Souza Neto, Anamaria Martins Moreira, Genoveva Vargas-Solar et al.
This paper proposes a model for specifying data flow based parallel data processing programs agnostic of target Big Data processing frameworks. The paper focuses on the formal abstract specification of non-iterative and iterative programs, generalizing the strategies adopted by data flow Big Data processing frameworks. The proposed model relies on monoid AlgebraandPetri Netstoabstract Big Data processing programs in two levels: a high level representing the program data flow and a lower level representing data transformation operations (e.g., filtering, aggregation, join). We extend the model for data processing programs proposed in [1], to enable the use of iterative programs. The general specification of iterative data processing programs implemented by data flow-based parallel programming models is essential given the democratization of iterative and greedy Big Data analytics algorithms. Indeed, these algorithms call for revisiting parallel programming models to express iterations. The paper gives a comparative analysis of the iteration strategies proposed byApache Spark, DryadLINQ, Apache Beam and Apache Flink. It discusses how the model achieves to generalize these strategies.
CLMay 3, 2021
Looking for COVID-19 misinformation in multilingual social media textsRaj Ratn Pranesh, Mehrdad Farokhnejad, Ambesh Shekhar et al.
This paper presents the Multilingual COVID-19 Analysis Method (CMTA) for detecting and observing the spread of misinformation about this disease within texts. CMTA proposes a data science (DS) pipeline that applies machine learning models for processing, classifying (Dense-CNN) and analyzing (MBERT) multilingual (micro)-texts. DS pipeline data preparation tasks extract features from multilingual textual data and categorize it into specific information classes (i.e., 'false', 'partly false', 'misleading'). The CMTA pipeline has been experimented with multilingual micro-texts (tweets), showing misinformation spread across different languages. To assess the performance of CMTA and put it in perspective, we performed a comparative analysis of CMTA with eight monolingual models used for detecting misinformation. The comparison shows that CMTA has surpassed various monolingual models and suggests that it can be used as a general method for detecting misinformation in multilingual micro-texts. CMTA experimental results show misinformation trends about COVID-19 in different languages during the first pandemic months.