Fabio Casati

HC
h-index62
28papers
431citations
Novelty33%
AI Score34

28 Papers

LGSep 30, 2022
Rethinking and Recomputing the Value of Machine Learning Models

Burcu Sayin, Jie Yang, Xinyue Chen et al.

In this paper, we argue that the prevailing approach to training and evaluating machine learning models often fails to consider their real-world application within organizational or societal contexts, where they are intended to create beneficial value for people. We propose a shift in perspective, redefining model assessment and selection to emphasize integration into workflows that combine machine predictions with human expertise, particularly in scenarios requiring human intervention for low-confidence predictions. Traditional metrics like accuracy and f-score fail to capture the beneficial value of models in such hybrid settings. To address this, we introduce a simple yet theoretically sound "value" metric that incorporates task-specific costs for correct predictions, errors, and rejections, offering a practical framework for real-world evaluation. Through extensive experiments, we show that existing metrics fail to capture real-world needs, often leading to suboptimal choices in terms of value when used to rank classifiers. Furthermore, we emphasize the critical role of calibration in determining model value, showing that simple, well-calibrated models can often outperform more complex models that are challenging to calibrate.

CLSep 28, 2025
BTC-SAM: Leveraging LLMs for Generation of Bias Test Cases for Sentiment Analysis Models

Zsolt T. Kardkovacs, Lynda Djennane, Anna Field et al.

Sentiment Analysis (SA) models harbor inherent social biases that can be harmful in real-world applications. These biases are identified by examining the output of SA models for sentences that only vary in the identity groups of the subjects. Constructing natural, linguistically rich, relevant, and diverse sets of sentences that provide sufficient coverage over the domain is expensive, especially when addressing a wide range of biases: it requires domain experts and/or crowd-sourcing. In this paper, we present a novel bias testing framework, BTC-SAM, which generates high-quality test cases for bias testing in SA models with minimal specification using Large Language Models (LLMs) for the controllable generation of test sentences. Our experiments show that relying on LLMs can provide high linguistic variation and diversity in the test sentences, thereby offering better test coverage compared to base prompting methods even for previously unseen biases.

AIMay 29, 2023
ProcessGPT: Transforming Business Process Management with Generative Artificial Intelligence

Amin Beheshti, Jian Yang, Quan Z. Sheng et al.

Generative Pre-trained Transformer (GPT) is a state-of-the-art machine learning model capable of generating human-like text through natural language processing (NLP). GPT is trained on massive amounts of text data and uses deep learning techniques to learn patterns and relationships within the data, enabling it to generate coherent and contextually appropriate text. This position paper proposes using GPT technology to generate new process models when/if needed. We introduce ProcessGPT as a new technology that has the potential to enhance decision-making in data-centric and knowledge-intensive processes. ProcessGPT can be designed by training a generative pre-trained transformer model on a large dataset of business process data. This model can then be fine-tuned on specific process domains and trained to generate process flows and make decisions based on context and user input. The model can be integrated with NLP and machine learning techniques to provide insights and recommendations for process improvement. Furthermore, the model can automate repetitive tasks and improve process efficiency while enabling knowledge workers to communicate analysis findings, supporting evidence, and make decisions. ProcessGPT can revolutionize business process management (BPM) by offering a powerful tool for process augmentation, automation and improvement. Finally, we demonstrate how ProcessGPT can be a powerful tool for augmenting data engineers in maintaining data ecosystem processes within large bank organizations. Our scenario highlights the potential of this approach to improve efficiency, reduce costs, and enhance the quality of business operations through the automation of data-centric and knowledge-intensive processes. These results underscore the promise of ProcessGPT as a transformative technology for organizations looking to improve their process workflows.

LGDec 13, 2021
On the Value of ML Models

Fabio Casati, Pierre-André Noël, Jie Yang

We argue that, when establishing and benchmarking Machine Learning (ML) models, the research community should favour evaluation metrics that better capture the value delivered by their model in practical applications. For a specific class of use cases -- selective classification -- we show that not only can it be simple enough to do, but that it has import consequences and provides insights what to look for in a ``good'' ML model.

CLSep 20, 2021
Crowdsourcing Diverse Paraphrases for Training Task-oriented Bots

Jorge Ramírez, Auday Berro, Marcos Baez et al.

A prominent approach to build datasets for training task-oriented bots is crowd-based paraphrasing. Current approaches, however, assume the crowd would naturally provide diverse paraphrases or focus only on lexical diversity. In this WiP we addressed an overlooked aspect of diversity, introducing an approach for guiding the crowdsourcing process towards paraphrases that are syntactically diverse.

HCJul 28, 2021
On the state of reporting in crowdsourcing experiments and a checklist to aid current practices

Jorge Ramírez, Burcu Sayin, Marcos Baez et al.

Crowdsourcing is being increasingly adopted as a platform to run studies with human subjects. Running a crowdsourcing experiment involves several choices and strategies to successfully port an experimental design into an otherwise uncontrolled research environment, e.g., sampling crowd workers, mapping experimental conditions to micro-tasks, or ensure quality contributions. While several guidelines inform researchers in these choices, guidance of how and what to report from crowdsourcing experiments has been largely overlooked. If under-reported, implementation choices constitute variability sources that can affect the experiment's reproducibility and prevent a fair assessment of research outcomes. In this paper, we examine the current state of reporting of crowdsourcing experiments and offer guidance to address associated reporting issues. We start by identifying sensible implementation choices, relying on existing literature and interviews with experts, to then extensively analyze the reporting of 171 crowdsourcing experiments. Informed by this process, we propose a checklist for reporting crowdsourcing experiments.

LGJan 21, 2021
Active Hybrid Classification

Evgeny Krivosheev, Fabio Casati, Alessandro Bozzon

Hybrid crowd-machine classifiers can achieve superior performance by combining the cost-effectiveness of automatic classification with the accuracy of human judgment. This paper shows how crowd and machines can support each other in tackling classification problems. Specifically, we propose an architecture that orchestrates active learning and crowd classification and combines them in a virtuous cycle. We show that when the pool of items to classify is finite we face learning vs. exploitation trade-off in hybrid classification, as we need to balance crowd tasks optimized for creating a training dataset with tasks optimized for classifying items in the pool. We define the problem, propose a set of heuristics and evaluate the approach on three real-world datasets with different characteristics in terms of machine and crowd classification performance, showing that our active hybrid approach significantly outperforms baselines.

HCNov 8, 2020
Chatbots as conversational healthcare services

Mlađan Jovanović, Marcos Baez, Fabio Casati

Chatbots are emerging as a promising platform for accessing and delivering healthcare services. The evidence is in the growing number of publicly available chatbots aiming at taking an active role in the provision of prevention, diagnosis, and treatment services. This article takes a closer look at how these emerging chatbots address design aspects relevant to healthcare service provision, emphasizing the Human-AI interaction aspects and the transparency in AI automation and decision making.

HCNov 5, 2020
On the impact of predicate complexity in crowdsourced classification tasks

Jorge Ramírez, Marcos Baez, Fabio Casati et al.

This paper explores and offers guidance on a specific and relevant problem in task design for crowdsourcing: how to formulate a complex question used to classify a set of items. In micro-task markets, classification is still among the most popular tasks. We situate our work in the context of information retrieval and multi-predicate classification, i.e., classifying a set of items based on a set of conditions. Our experiments cover a wide range of tasks and domains, and also consider crowd workers alone and in tandem with machine learning classifiers. We provide empirical evidence into how the resulting classification performance is affected by different predicate formulation strategies, emphasizing the importance of predicate formulation as a task design dimension in crowdsourcing.

HCNov 5, 2020
Challenges and strategies for running controlled crowdsourcing experiments

Jorge Ramírez, Marcos Baez, Fabio Casati et al.

This paper reports on the challenges and lessons we learned while running controlled experiments in crowdsourcing platforms. Crowdsourcing is becoming an attractive technique to engage a diverse and large pool of subjects in experimental research, allowing researchers to achieve levels of scale and completion times that would otherwise not be feasible in lab settings. However, the scale and flexibility comes at the cost of multiple and sometimes unknown sources of bias and confounding factors that arise from technical limitations of crowdsourcing platforms and from the challenges of running controlled experiments in the "wild". In this paper, we take our experience in running systematic evaluations of task design as a motivating example to explore, describe, and quantify the potential impact of running uncontrolled crowdsourcing experiments and derive possible coping strategies. Among the challenges identified, we can mention sampling bias, controlling the assignment of subjects to experimental conditions, learning effects, and reliability of crowdsourcing results. According to our empirical studies, the impact of potential biases and confounding factors can amount to a 38\% loss in the utility of the data collected in uncontrolled settings; and it can significantly change the outcome of experiments. These issues ultimately inspired us to implement CrowdHub, a system that sits on top of major crowdsourcing platforms and allows researchers and practitioners to run controlled crowdsourcing projects.

SESep 7, 2020
Chatbot integration in few patterns

Marcos Baez, Florian Daniel, Fabio Casati et al.

Chatbots are software agents that are able to interact with humans in natural language. Their intuitive interaction paradigm is expected to significantly reshape the software landscape of tomorrow, while already today chatbots are invading a multitude of scenarios and contexts. This article takes a developer's perspective, identifies a set of architectural patterns that capture different chatbot integration scenarios, and reviews state-of-the-art development aids.

DBJan 17, 2020
Siamese Graph Neural Networks for Data Integration

Evgeny Krivosheev, Mattia Atzeni, Katsiaryna Mirylenka et al.

Data integration has been studied extensively for decades and approached from different angles. However, this domain still remains largely rule-driven and lacks universal automation. Recent development in machine learning and in particular deep learning has opened the way to more general and more efficient solutions to data integration problems. In this work, we propose a general approach to modeling and integrating entities from structured data, such as relational databases, as well as unstructured sources, such as free text from news articles. Our approach is designed to explicitly model and leverage relations between entities, thereby using all available information and preserving as much context as possible. This is achieved by combining siamese and graph neural networks to propagate information between connected entities and support high scalability. We evaluate our method on the task of integrating data about business entities, and we demonstrate that it outperforms standard rule-based systems, as well as other deep learning approaches that do not use graph-based representations.

HCSep 6, 2019
CrowdHub: Extending crowdsourcing platforms for the controlled evaluation of tasks designs

Jorge Ramírez, Simone Degiacomi, Davide Zanella et al.

We present CrowdHub, a tool for running systematic evaluations of task designs on top of crowdsourcing platforms. The goal is to support the evaluation process, avoiding potential experimental biases that, according to our empirical studies, can amount to 38% loss in the utility of the collected dataset in uncontrolled settings. Using CrowdHub, researchers can map their experimental design and automate the complex process of managing task execution over time while controlling for returning workers and crowd demographics, thus reducing bias, increasing utility of collected data, and making more efficient use of a limited pool of subjects.

HCSep 6, 2019
Understanding the Impact of Text Highlighting in Crowdsourcing Tasks

Jorge Ramírez, Marcos Baez, Fabio Casati et al.

Text classification is one of the most common goals of machine learning (ML) projects, and also one of the most frequent human intelligence tasks in crowdsourcing platforms. ML has mixed success in such tasks depending on the nature of the problem, while crowd-based classification has proven to be surprisingly effective, but can be expensive. Recently, hybrid text classification algorithms, combining human computation and machine learning, have been proposed to improve accuracy and reduce costs. One way to do so is to have ML highlight or emphasize portions of text that it believes to be more relevant to the decision. Humans can then rely only on this text or read the entire text if the highlighted information is insufficient. In this paper, we investigate if and under what conditions highlighting selected parts of the text can (or cannot) improve classification cost and/or accuracy, and in general how it affects the process and outcome of the human intelligence tasks. We study this through a series of crowdsourcing experiments running over different datasets and with task designs imposing different cognitive demands. Our findings suggest that highlighting is effective in reducing classification effort but does not improve accuracy - and in fact, low-quality highlighting can decrease it.

IRApr 1, 2019
Combining Crowd and Machines for Multi-predicate Item Screening

Evgeny Krivosheev, Fabio Casati, Marcos Baez et al.

This paper discusses how crowd and machine classifiers can be efficiently combined to screen items that satisfy a set of predicates. We show that this is a recurring problem in many domains, present machine-human (hybrid) algorithms that screen items efficiently and estimate the gain over human-only or machine-only screening in terms of performance and cost. We further show how, given a new classification problem and a set of classifiers of unknown accuracy for the problem at hand, we can identify how to manage the cost-accuracy trade off by progressively determining if we should spend budget to obtain test data (to assess the accuracy of the given classifiers), or to train an ensemble of classifiers, or whether we should leverage the existing machine classifiers with the crowd, and in this case how to efficiently combine them based on their estimated characteristics to obtain the classification. We demonstrate that the techniques we propose obtain significant cost/accuracy improvements with respect to the leading classification algorithms.

HCJan 14, 2019
Technologies for promoting social participation in later life

Marcos Baez, Radoslaw Nielek, Fabio Casati et al.

Social participation is known to bring great benefits to the health and well-being of people as they age. From being in contact with others to engaging in group activities, keeping socially active can help slow down the effects of age-related declines, reduce risks of loneliness and social isolation and even mortality in old age. There are unfortunately a variety of barriers that make it difficult for older adults to engage in social activities in a regular basis. In this chapter, we give an overview of the challenges to social participation and discuss how technology can help overcome these barriers and promote participation in social activities. We examine two particular research threads and designs, exploring ways in which technology can support co-located and virtual participation: i) an application that motivates the virtual participation in group training programs, and ii) a location-based game that supports co-located intergenerational ICT training classes. We discuss the effectiveness and limitations of various design choices in the two use cases and outline the lessons learned

CYMay 31, 2018
Designing for Co-located and Virtual Social Interactions in Residential Care

Francisco Ibarra, Marcos Baez, Francesca Fiore et al.

In this paper we explore the feasibility and design challenges in supporting co-located and virtual social interactions in residential care by building on the practice of reminiscence. Motivated by the challenges of social interaction in this context, we first explore the feasibility of a reminiscence-based social interaction tool designed to stimulate conversation in residential care with different stakeholders. Then, we explore the design challenges in supporting an assisting role in co-located reminiscence sessions, by running pilot studies with a technology probe. Our findings point to the feasibility of the tool and the willingness of stakeholders to contribute in the process, although with some skepticism about virtual interactions. The reminiscence sessions showed that compromises are needed when designing for both story collection and conversation stimulation, evidencing specific design areas where further exploration is needed.

HCMay 31, 2018
CrowdRev: A platform for Crowd-based Screening of Literature Reviews

Jorge Ramirez, Evgeny Krivosheev, Marcos Baez et al.

In this paper and demo we present a crowd and crowd+AI based system, called CrowdRev, supporting the screening phase of literature reviews and achieving the same quality as author classification at a fraction of the cost, and near-instantly. CrowdRev makes it easy for authors to leverage the crowd, and ensures that no money is wasted even in the face of difficult papers or criteria: if the system detects that the task is too hard for the crowd, it just gives up trying (for that paper, or for that criteria, or altogether), without wasting money and never compromising on quality.

HCMay 31, 2018
Crowdsourcing for Reminiscence Chatbot Design

Svetlana Nikitina, Florian Daniel, Marcos Baez et al.

In this work-in-progress paper we discuss the challenges in identifying effective and scalable crowd-based strategies for designing content, conversation logic, and meaningful metrics for a reminiscence chatbot targeted at older adults. We formalize the problem and outline the main research questions that drive the research agenda in chatbot design for reminiscence and for relational agents for older adults in general.

HCMar 21, 2018
Crowd-Machine Collaboration for Item Screening

Evgeny Krivosheev, Bahareh Harandizadeh, Fabio Casati et al.

In this paper we describe how crowd and machine classifier can be efficiently combined to screen items that satisfy a set of predicates. We show that this is a recurring problem in many domains, present machine-human (hybrid) algorithms that screen items efficiently and estimate the gain over human-only or machine-only screening in terms of performance and cost.

HCMar 21, 2018
Crowd-based Multi-Predicate Screening of Papers in Literature Reviews

Evgeny Krivosheev, Fabio Casati, Boualem Benatallah

Systematic literature reviews (SLRs) are one of the most common and useful form of scientific research and publication. Tens of thousands of SLRs are published each year, and this rate is growing across all fields of science. Performing an accurate, complete and unbiased SLR is however a difficult and expensive endeavor. This is true in general for all phases of a literature review, and in particular for the paper screening phase, where authors lter a set of potentially in-scope papers based on a number of exclusion criteria. To address the problem, in recent years the research community has began to explore the use of the crowd to allow for a faster, accurate, cheaper and unbiased screening of papers. Initial results show that crowdsourcing can be effective, even for relatively complex reviews. In this paper we derive and analyze a set of strategies for crowd-based screening, and show that an adaptive strategy, that continuously re-assesses the statistical properties of the problem to minimize the number of votes needed to take decisions for each paper, significantly outperforms a number of non-adaptive approaches in terms of cost and accuracy. We validate both applicability and results of the approach through a set of crowdsourcing experiments, and discuss properties of the problem and algorithms that we believe to be generally of interest for classification problems where items are classified via a series of successive tests (as it often happens in medicine).

SENov 15, 2017
Programming Bots by Synthesizing Natural Language Expressions into API Invocations

Shayan Zamanirad, Boualem Benatallah, Moshe Chai Barukh et al.

At present, bots are still in their preliminary stages of development. Many are relatively simple, or developed ad-hoc for a very specific use-case. For this reason, they are typically programmed manually, or utilize machine-learning classifiers to interpret a fixed set of user utterances. In reality, real world conversations with humans require support for dynamically capturing users expressions. Moreover, bots will derive immeasurable value by programming them to invoke APIs for their results. Today, within the Web and Mobile development community, complex applications are being stringed together with a few lines of code -- all made possible by APIs. Yet, developers today are not as empowered to program bots in much the same way. To overcome this, we introduce BotBase, a bot programming platform that dynamically synthesizes natural language user expressions into API invocations. Our solution is two faceted: Firstly, we construct an API knowledge graph to encode and evolve APIs; secondly, leveraging the above we apply techniques in NLP, ML and Entity Recognition to perform the required synthesis from natural language user expressions into API calls.

IRSep 15, 2017
Crowdsourcing Paper Screening in Systematic Literature Reviews

Evgeny Krivosheev, Fabio Casati, Valentina Caforio et al.

Literature reviews allow scientists to stand on the shoulders of giants, showing promising directions, summarizing progress, and pointing out existing challenges in research. At the same time conducting a systematic literature review is a laborious and consequently expensive process. In the last decade, there have a few studies on crowdsourcing in literature reviews. This paper explores the feasibility of crowdsourcing for facilitating the literature review process in terms of results, time and effort, as well as to identify which crowdsourcing strategies provide the best results based on the budget available. In particular we focus on the screening phase of the literature review process and we contribute and assess methods for identifying the size of tests, labels required per paper, and classification functions as well as methods to split the crowdsourcing process in phases to improve results. Finally, we present our findings based on experiments run on Crowdflower.

HCMar 18, 2017
Designing for older adults: review of touchscreen design guidelines

Leysan Nurgalieva, Juan Jose Jara Laconich, Marcos Baez et al.

The distinct abilities of older adults to interact with computers has motivated a wide range of contributions in the the form of design guidelines for making technologies usable and accessible for the elderly population. However, despite the growing effort by the research community, the adoption of guidelines by developers and designers has been scant or not properly translated into more accessible interaction systems. In this paper we explore this issue by reporting on a qualitative outcomes of a systematic review of 204 research-derived design guidelines for touchscreen applications. We report first on the different definitions of "elderly" and assess the reliability, organization and accessibility of the guidelines. Then we present our early attempt at facilitating the reporting and access of such guidelines to researchers and practitioners.

HCSep 17, 2016
Online Group-exercises for Older Adults of Different Physical Abilities

Marcos Baez, Francisco Ibarra, Iman Khaghani Far et al.

In this paper we describe the design and validation of a virtual fitness environment aiming at keeping older adults physically and socially active. We target particularly older adults who are socially more isolated, physically less active, and with less chances of training in a gym. The virtual fitness environment, namely Gymcentral, was designed to enable and motivate older adults to follow personalised exercises from home, with a (heterogeneous) group of remote friends and under the remote supervision of a Coach. We take the training activity as an opportunity to create social interactions, by complementing training features with social instruments. Finally, we report on the feasibility and effectiveness of the virtual environment, as well as its effects on the usage and social interactions, from an intervention study in Trento, Italy

HCJul 6, 2016
CrowdCafe - Mobile Crowdsourcing Platform

Pavel Kucherbaev, Azad Abad, Stefano Tranquillini et al.

In this paper we present a mobile crowdsourcing platform CrowdCafe, where people can perform microtasks using their smartphones while they ride a bus, travel by train, stand in a queue or wait for an appointment. These microtasks are executed in exchange for rewards provided by local stores, such as coffee, desserts and bus tickets. We present the concept, the implementation and the evaluation by conducting a study with 52 participants, having 1108 tasks completed.

CYMar 9, 2016
Personalized Persuasion for Social Interactions in Nursing Homes

Marcos Baez, Chiara Dalpiaz, Fatbardha Hoxha et al.

This paper presents our preliminary investigation and approach towards a mixed physical-virtual technology for stimulating social interactions among and with older adults in nursing homes. We report on set of surveys, apps and focus groups aiming at understanding the different motivations and obstacles in promoting social interactions in institutionalised care. We then present our approach to address some of the key themes found, e.g., the technological disparity, lack of conversation topics and opportunities to interact