Carlos Castillo

h-index10

49papers

2,822citations

Novelty36%

AI Score54

Ranked #27,595 of 201,326 authors (top 14%)#227 in IR (top 10%)

49 Papers

LGJun 1

Model Multiplicity and Predictive Arbitrariness in Recidivism Risk Assessment

Ashwin Singh, Carlos Castillo

Prediction tasks over individual futures, which are inherently noisy, often admit multiple similarly accurate models. When these models produce different predictions for the same individual, they raise concerns of arbitrariness in decision-making. How severe can this arbitrariness be, in theory and in practice? How can it be resolved to support high-stakes risk assessment? We address these questions through a study of a machine learning-based decision support system for recidivism risk assessment that has been in use for over 15 years. By translating complex legal rules into an algorithm for labeling post release outcomes (recidivist or non-recidivist), we first construct a dataset of thousands of inmate releases. Using this dataset, we learn interpretable models that improve predictive performance, reduce error-rate disparities between groups, and ensure that rehabilitative progress lowers risk scores. Next, we study predictive multiplicity, by first deriving a tight lower bound on the expected predictive agreement of any finite set of models over a dataset, and then by evaluating the extent to which structural diversity (e.g., different model coefficients) within this set translates to predictive multiplicity (i.e., different predictions for the same individual). Our experiments indicate that the existence of many similarly accurate models with comparable error-rate disparities does not necessarily translate into severe predictive multiplicity. Empirically, similarly performant models can exhibit substantially higher predictive agreement than worst-case theoretical guarantees suggest. We find that a simple policy that assigns each inmate the lowest risk among these models is effective for addressing predictive arbitrariness.

IRApr 21, 2022

Cross-Lingual Query-Based Summarization of Crisis-Related Social Media: An Abstractive Approach Using Transformers

Fedor Vitiugin, Carlos Castillo

Relevant and timely information collected from social media during crises can be an invaluable resource for emergency management. However, extracting this information remains a challenging task, particularly when dealing with social media postings in multiple languages. This work proposes a cross-lingual method for retrieving and summarizing crisis-relevant information from social media postings. We describe a uniform way of expressing various information needs through structured queries and a way of creating summaries answering those information needs. The method is based on multilingual transformers embeddings. Queries are written in one of the languages supported by the embeddings, and the extracted sentences can be in any of the other languages supported. Abstractive summaries are created by transformers. The evaluation, done by crowdsourcing evaluators and emergency management experts, and carried out on collections extracted from Twitter during five large-scale disasters spanning ten languages, shows the flexibility of our approach. The generated summaries are regarded as more focused, structured, and coherent than existing state-of-the-art methods, and experts compare them favorably against summaries created by existing, state-of-the-art methods.

AIMar 24, 2022

Human Response to an AI-Based Decision Support System: A User Study on the Effects of Accuracy and Bias

David Solans, Andrea Beretta, Manuel Portela et al.

Artificial Intelligence (AI) is increasingly used to build Decision Support Systems (DSS) across many domains. This paper describes a series of experiments designed to observe human response to different characteristics of a DSS such as accuracy and bias, particularly the extent to which participants rely on the DSS, and the performance they achieve. In our experiments, participants play a simple online game inspired by so-called "wildcat" (i.e., exploratory) drilling for oil. The landscape has two layers: a visible layer describing the costs (terrain), and a hidden layer describing the reward (oil yield). Participants in the control group play the game without receiving any assistance, while in treatment groups they are assisted by a DSS suggesting places to drill. For certain treatments, the DSS does not consider costs, but only rewards, which introduces a bias that is observable by users. Between subjects, we vary the accuracy and bias of the DSS, and observe the participants' total score, time to completion, the extent to which they follow or ignore suggestions. We also measure the acceptability of the DSS in an exit survey. Our results show that participants tend to score better with the DSS, that the score increase is due to users following the DSS advice, and related to the difficulty of the game and the accuracy of the DSS. We observe that this setting elicits mostly rational behavior from participants, who place a moderate amount of trust in the DSS and show neither algorithmic aversion (under-reliance) nor automation bias (over-reliance).However, their stated willingness to accept the DSS in the exit survey seems less sensitive to the accuracy of the DSS than their behavior, suggesting that users are only partially aware of the (lack of) accuracy of the DSS.

GRApr 14

Fast Voxelization and Level of Detail for Microgeometry Rendering

Javier Fabre, Carlos Castillo, Carlos Rodriguez-Pardo et al.

Many materials show anisotropic light scattering patterns due to the shape and local alignment of their underlying micro structures: surfaces with small elements such as fibers, or the ridges of a brushed metal, are very sparse and require a high spatial resolution to be properly represented as a volume. The acquisition of voxel data from such objects is a time and memory-intensive task, and most rendering approaches require an additional Level-of-Detail (LoD) data structure to aggregate the visual appearance, as observed from multiple distances, in order to reduce the number of samples computed per pixel (E.g.: MIP mapping). In this work we introduce first, an efficient parallel voxelization method designed to facilitate fast data aggregation at multiple resolution levels, and second, a novel representation based on hierarchical SGGX clustering that provides better accuracy than baseline methods. We validate our approach with a CUDA-based implementation of the voxelizer, tested both on triangle meshes and volumetric fabrics modeled with explicit fibers. Finally, we show the results generated with a path tracer based on the proposed LoD rendering model.

AINov 20, 2023

Responsible AI Research Needs Impact Statements Too

Alexandra Olteanu, Michael Ekstrand, Carlos Castillo et al.

All types of research, development, and policy work can have unintended, adverse consequences - work in responsible artificial intelligence (RAI), ethical AI, or ethics in AI is no exception.

LGAug 18, 2023

Disparity, Inequality, and Accuracy Tradeoffs in Graph Neural Networks for Node Classification

Arpit Merchant, Carlos Castillo

Graph neural networks (GNNs) are increasingly used in critical human applications for predicting node labels in attributed graphs. Their ability to aggregate features from nodes' neighbors for accurate classification also has the capacity to exacerbate existing biases in data or to introduce new ones towards members from protected demographic groups. Thus, it is imperative to quantify how GNNs may be biased and to what extent their harmful effects may be mitigated. To this end, we propose two new GNN-agnostic interventions namely, (i) PFR-AX which decreases the separability between nodes in protected and non-protected groups, and (ii) PostProcess which updates model predictions based on a blackbox policy to minimize differences between error rates across demographic groups. Through a large set of experiments on four datasets, we frame the efficacies of our approaches (and three variants) in terms of their algorithmic fairness-accuracy tradeoff and benchmark our results against three strong baseline interventions on three state-of-the-art GNN models. Our results show that no single intervention offers a universally optimal tradeoff, but PFR-AX and PostProcess provide granular control and improve model confidence when correctly predicting positive outcomes for nodes in protected groups.

IRMay 27, 2019Code

FairSearch: A Tool For Fairness in Ranked Search Results

Meike Zehlike, Tom Sühr, Carlos Castillo et al.

Ranked search results and recommendations have become the main mechanism by which we find content, products, places, and people online. With hiring, selecting, purchasing, and dating being increasingly mediated by algorithms, rankings may determine career and business opportunities, educational placement, access to benefits, and even social and reproductive success. It is therefore of societal and ethical importance to ask whether search results can demote, marginalize, or exclude individuals of unprivileged groups or promote products with undesired features. In this paper we present FairSearch, the first fair open source search API to provide fairness notions in ranked search results. We implement two algorithms from the fair ranking literature, namely FA*IR (Zehlike et al., 2017) and DELTR (Zehlike and Castillo, 2018) and provide them as stand-alone libraries in Python and Java. Additionally we implement interfaces to Elasticsearch for both algorithms, that use the aforementioned Java libraries and are then provided as Elasticsearch plugins. Elasticsearch is a well-known search engine API based on Apache Lucene. With our plugins we enable search engine developers who wish to ensure fair search results of different styles to easily integrate DELTR and FA*IR into their existing Elasticsearch environment.

CRJan 9

The Echo Chamber Multi-Turn LLM Jailbreak

Ahmad Alobaid, Martí Jordà Roca, Carlos Castillo et al.

The availability of Large Language Models (LLMs) has led to a new generation of powerful chatbots that can be developed at relatively low cost. As companies deploy these tools, security challenges need to be addressed to prevent financial loss and reputational damage. A key security challenge is jailbreaking, the malicious manipulation of prompts and inputs to bypass a chatbot's safety guardrails. Multi-turn attacks are a relatively new form of jailbreaking involving a carefully crafted chain of interactions with a chatbot. We introduce Echo Chamber, a new multi-turn attack using a gradual escalation method. We describe this attack in detail, compare it to other multi-turn attacks, and demonstrate its performance against multiple state-of-the-art models through extensive evaluation.

AIApr 27

An empirical evaluation of the risks of AI model updates using clinical data: stability, arbitrariness, and fairness

Ioannis Bilionis, Ricardo C. Berrios, Luis Fernandez-Luque et al.

Artificial Intelligence and Machine Learning (AI/ML) models used in clinical settings are increasingly deployed to support clinical decision-making. However, when training data become stale due to changes in demographics, environment, or patient behaviors, model performance can degrade substantially. While updating models with new training data is necessary, such updates may also introduce new risks. We evaluated the proposed monitoring framework on four publicly available U.S.-based Type 1 Diabetes datasets containing high-resolution continuous glucose monitoring (CGM) data, comprising approximately 11,300 weekly observations from 496 participants under 20 years of age. All datasets included structured sociodemographic information. Using the prediction of severe hyperglycemia events in children with type 1 diabetes as a case study, we examine how different model update strategies can adversely affect model stability (e.g., by causing predictions to "flip" for a large number of cases after an update), increase arbitrariness in predictions, or worsen accuracy equity and the balance of error rates across subpopulations. We propose multiple dimensions for continuous monitoring to detect these issues and argue that such monitoring is essential for the development of trustworthy clinical decision support systems.

HCFeb 17, 2025

A Comparison of Human and Machine Learning Errors in Face Recognition

Marina Estévez-Almenzar, Ricardo Baeza-Yates, Carlos Castillo

Machine learning applications in high-stakes scenarios should always operate under human oversight. Developing an optimal combination of human and machine intelligence requires an understanding of their complementarities, particularly regarding the similarities and differences in the way they make mistakes. We perform extensive experiments in the area of face recognition and compare two automated face recognition systems against human annotators through a demographically balanced user study. Our research uncovers important ways in which machine learning errors and human errors differ from each other, and suggests potential strategies in which human-machine collaboration can improve accuracy in face recognition.

CYAug 28, 2025

Synthetic CVs To Build and Test Fairness-Aware Hiring Tools

Jorge Saldivar, Anna Gatzioura, Carlos Castillo

Algorithmic hiring has become increasingly necessary in some sectors as it promises to deal with hundreds or even thousands of applicants. At the heart of these systems are algorithms designed to retrieve and rank candidate profiles, which are usually represented by Curricula Vitae (CVs). Research has shown, however, that such technologies can inadvertently introduce bias, leading to discrimination based on factors such as candidates' age, gender, or national origin. Developing methods to measure, mitigate, and explain bias in algorithmic hiring, as well as to evaluate and compare fairness techniques before deployment, requires sets of CVs that reflect the characteristics of people from diverse backgrounds. However, datasets of these characteristics that can be used to conduct this research do not exist. To address this limitation, this paper introduces an approach for building a synthetic dataset of CVs with features modeled on real materials collected through a data donation campaign. Additionally, the resulting dataset of 1,730 CVs is presented, which we envision as a potential benchmarking standard for research on algorithmic hiring discrimination.

LGDec 27, 2024

Disparate Model Performance and Stability in Machine Learning Clinical Support for Diabetes and Heart Diseases

Ioannis Bilionis, Ricardo C. Berrios, Luis Fernandez-Luque et al.

Machine Learning (ML) algorithms are vital for supporting clinical decision-making in biomedical informatics. However, their predictive performance can vary across demographic groups, often due to the underrepresentation of historically marginalized populations in training datasets. The investigation reveals widespread sex- and age-related inequities in chronic disease datasets and their derived ML models. Thus, a novel analytical framework is introduced, combining systematic arbitrariness with traditional metrics like accuracy and data complexity. The analysis of data from over 25,000 individuals with chronic diseases revealed mild sex-related disparities, favoring predictive accuracy for males, and significant age-related differences, with better accuracy for younger patients. Notably, older patients showed inconsistent predictive accuracy across seven datasets, linked to higher data complexity and lower model performance. This highlights that representativeness in training data alone does not guarantee equitable outcomes, and model arbitrariness must be addressed before deploying models in clinical settings.

CYFeb 1, 2022

Rewiring What-to-Watch-Next Recommendations to Reduce Radicalization Pathways

Francesco Fabbri, Yanhao Wang, Francesco Bonchi et al.

Recommender systems typically suggest to users content similar to what they consumed in the past. If a user happens to be exposed to strongly polarized content, she might subsequently receive recommendations which may steer her towards more and more radicalized content, eventually being trapped in what we call a "radicalization pathway". In this paper, we study the problem of mitigating radicalization pathways using a graph-based approach. Specifically, we model the set of recommendations of a "what-to-watch-next" recommender as a d-regular directed graph where nodes correspond to content items, links to recommendations, and paths to possible user sessions. We measure the "segregation" score of a node representing radicalized content as the expected length of a random walk from that node to any node representing non-radicalized content. High segregation scores are associated to larger chances to get users trapped in radicalization pathways. Hence, we define the problem of reducing the prevalence of radicalization pathways by selecting a small number of edges to "rewire", so to minimize the maximum of segregation scores among all radicalized nodes, while maintaining the relevance of the recommendations. We prove that the problem of finding the optimal set of recommendations to rewire is NP-hard and NP-hard to approximate within any factor. Therefore, we turn our attention to heuristics, and propose an efficient yet effective greedy algorithm based on the absorbing random walk theory. Our experiments on real-world datasets in the context of video and news recommendations confirm the effectiveness of our proposal.

HCJan 25, 2022

Diversity in the Music Listening Experience: Insights from Focus Group Interviews

Lorenzo Porcaro, Emilia Gómez, Carlos Castillo

Music listening in today's digital spaces is highly characterized by the availability of huge music catalogues, accessible by people all over the world. In this scenario, recommender systems are designed to guide listeners in finding tracks and artists that best fit their requests, having therefore the power to influence the diversity of the music they listen to. Albeit several works have proposed new techniques for developing diversity-aware recommendations, little is known about how people perceive diversity while interacting with music recommendations. In this study, we interview several listeners about the role that diversity plays in their listening experience, trying to get a better understanding of how they interact with music recommendations. We recruit the listeners among the participants of a previous quantitative study, where they were confronted with the notion of diversity when asked to identify, from a series of electronic music lists, the most diverse ones according to their beliefs. As a follow-up, in this qualitative study we carry out semi-structured interviews to understand how listeners may assess the diversity of a music list and to investigate their experiences with music recommendation diversity. We report here our main findings on 1) what can influence the diversity assessment of tracks and artists' music lists, and 2) which factors can characterize listeners' interaction with music recommendation diversity.

CLOct 25, 2021

SciClops: Detecting and Contextualizing Scientific Claims for Assisting Manual Fact-Checking

Panayiotis Smeros, Carlos Castillo, Karl Aberer

This paper describes SciClops, a method to help combat online scientific misinformation. Although automated fact-checking methods have gained significant attention recently, they require pre-existing ground-truth evidence, which, in the scientific context, is sparse and scattered across a constantly-evolving scientific literature. Existing methods do not exploit this literature, which can effectively contextualize and combat science-related fallacies. Furthermore, these methods rarely require human intervention, which is essential for the convoluted and critical domain of scientific misinformation. SciClops involves three main steps to process scientific claims found in online news articles and social media postings: extraction, clustering, and contextualization. First, the extraction of scientific claims takes place using a domain-specific, fine-tuned transformer model. Second, similar claims extracted from heterogeneous sources are clustered together with related scientific literature using a method that exploits their content and the connections among them. Third, check-worthy claims, broadcasted by popular yet unreliable sources, are highlighted together with an enhanced fact-checking context that includes related verified claims, news articles, and scientific papers. Extensive experiments show that SciClops tackles sufficiently these three steps, and effectively assists non-expert fact-checkers in the verification of complex scientific claims, outperforming commercial fact-checking systems.

LGMar 16, 2021

Predicting Early Dropout: Calibration and Algorithmic Fairness Considerations

Marzieh Karimi-Haghighi, Carlos Castillo, Davinia Hernandez-Leo et al.

In this work, the problem of predicting dropout risk in undergraduate studies is addressed from a perspective of algorithmic fairness. We develop a machine learning method to predict the risks of university dropout and underperformance. The objective is to understand if such a system can identify students at risk while avoiding potential discriminatory biases. When modeling both risks, we obtain prediction models with an Area Under the ROC Curve (AUC) of 0.77-0.78 based on the data available at the enrollment time, before the first year of studies starts. This data includes the students' demographics, the high school they attended, and their admission (average) grade. Our models are calibrated: they produce estimated probabilities for each risk, not mere scores. We analyze if this method leads to discriminatory outcomes for some sensitive groups in terms of prediction accuracy (AUC) and error rates (Generalized False Positive Rate, GFPR, or Generalized False Negative Rate, GFNR). The models exhibit some equity in terms of AUC and GFNR along groups. The similar GFNR means a similar probability of failing to detect risk for students who drop out. The disparities in GFPR are addressed through a mitigation process that does not affect the calibration of the model.

IRJan 28, 2021

Perceptions of Diversity in Electronic Music: the Impact of Listener, Artist, and Track Characteristics

Lorenzo Porcaro, Emilia Gómez, Carlos Castillo

Shared practices to assess the diversity of retrieval system results are still debated in the Information Retrieval community, partly because of the challenges of determining what diversity means in specific scenarios, and of understanding how diversity is perceived by end-users. The field of Music Information Retrieval is not exempt from this issue. Even if fields such as Musicology or Sociology of Music have a long tradition in questioning the representation and the impact of diversity in cultural environments, such knowledge has not been yet embedded into the design and development of music technologies. In this paper, focusing on electronic music, we investigate the characteristics of listeners, artists, and tracks that are influential in the perception of diversity. Specifically, we center our attention on 1) understanding the relationship between perceived diversity and computational methods to measure diversity, and 2) analyzing how listeners' domain knowledge and familiarity influence such perceived diversity. To accomplish this, we design a user-study in which listeners are asked to compare pairs of lists of tracks and artists, and to select the most diverse list from each pair. We compare participants' ratings with results obtained through computational models built using audio tracks' features and artist attributes. We find that such models are generally aligned with participants' choices when most of them agree that one list is more diverse than the other, while they present a mixed behaviour in cases where participants have little agreement. Moreover, we observe how differences in domain knowledge, familiarity, and demographics can influence the level of agreement among listeners, and between listeners and diversity metrics computed automatically.

IRDec 23, 2020

A Note on the Significance Adjustment for FA*IR with Two Protected Groups

Meike Zehlike, Tom Sühr, Carlos Castillo

In this report we provide an improvement of the significance adjustment from the FA*IR algorithm of Zehlike et al., which did not work for very short rankings in combination with a low minimum proportion $p$ for the protected group. We show how the minimum number of protected candidates per ranking position can be calculated exactly and provide a mapping from the continuous space of significance levels ($α$) to a discrete space of tables, which allows us to find $α_c$ using a binary search heuristic.

SIDec 10, 2020

Social Media Alerts can Improve, but not Replace Hydrological Models for Forecasting Floods

Valerio Lorini, Carlos Castillo, Domenico Nappo et al.

Social media can be used for disaster risk reduction as a complement to traditional information sources, and the literature has suggested numerous ways to achieve this. In the case of floods, for instance, data collection from social media can be triggered by a severe weather forecast and/or a flood prediction. By way of contrast, in this paper we explore the possibility of having an entirely independent flood monitoring system which is based completely on social media, and which is completely self-activated. This independence and self-activation would bring increased robustness, as the system would not depend on other mechanisms for forecasting. We observe that social media can indeed help in the early detection of some flood events that would otherwise not be detected until later, albeit at the cost of many false positives. Overall, our experiments suggest that social media signals should only be used to complement existing monitoring systems, and we provide various explanations to support this argument.

IRSep 3, 2020

Exploring Artist Gender Bias in Music Recommendation

Dougal Shakespeare, Lorenzo Porcaro, Emilia Gómez et al.

Music Recommender Systems (mRS) are designed to give personalised and meaningful recommendations of items (i.e. songs, playlists or artists) to a user base, thereby reflecting and further complementing individual users' specific music preferences. Whilst accuracy metrics have been widely applied to evaluate recommendations in mRS literature, evaluating a user's item utility from other impact-oriented perspectives, including their potential for discrimination, is still a novel evaluation practice in the music domain. In this work, we center our attention on a specific phenomenon for which we want to estimate if mRS may exacerbate its impact: gender bias. Our work presents an exploratory study, analyzing the extent to which commonly deployed state of the art Collaborative Filtering(CF) algorithms may act to further increase or decrease artist gender bias. To assess group biases introduced by CF, we deploy a recently proposed metric of bias disparity on two listening event datasets: the LFM-1b dataset, and the earlier constructed Celma's dataset. Our work traces the causes of disparity to variations in input gender distributions and user-item preferences, highlighting the effect such configurations can have on user's gender bias after recommendation generation.

IRAug 27, 2020

SciLens News Platform: A System for Real-Time Evaluation of News Articles

Angelika Romanou, Panayiotis Smeros, Carlos Castillo et al.

We demonstrate the SciLens News Platform, a novel system for evaluating the quality of news articles. The SciLens News Platform automatically collects contextual information about news articles in real-time and provides quality indicators about their validity and trustworthiness. These quality indicators derive from i) social media discussions regarding news articles, showcasing the reach and stance towards these articles, and ii) their content and their referenced sources, showcasing the journalistic foundations of these articles. Furthermore, the platform enables domain-experts to review articles and rate the quality of news sources. This augmented view of news articles, which combines automatically extracted indicators and domain-expert reviews, has provably helped the platform users to have a better consensus about the quality of the underlying articles. The platform is built in a distributed and robust fashion and runs operationally handling daily thousands of news articles. We evaluate the SciLens News Platform on the emerging topic of COVID-19 where we highlight the discrepancies between low and high-quality news outlets based on three axes, namely their newsroom activity, evidence seeking and social engagement. A live demonstration of the platform can be found here: http://scilens.epfl.ch.

HCJul 7, 2020

Modeling and mitigating human annotation errors to design efficient stream processing systems with human-in-the-loop machine learning

Rahul Pandey, Hemant Purohit, Carlos Castillo et al.

High-quality human annotations are necessary for creating effective machine learning-driven stream processing systems. We study hybrid stream processing systems based on a Human-In-The-Loop Machine Learning (HITL-ML) paradigm, in which one or many human annotators and an automatic classifier (trained at least partially by the human annotators) label an incoming stream of instances. This is typical of many near-real-time social media analytics and web applications, including annotating social media posts during emergencies by digital volunteer groups. From a practical perspective, low-quality human annotations result in wrong labels for retraining automated classifiers and indirectly contribute to the creation of inaccurate classifiers. Considering human annotation as a psychological process allows us to address these limitations. We show that human annotation quality is dependent on the ordering of instances shown to annotators and can be improved by local changes in the instance sequence/order provided to the annotators, yielding a more accurate annotation of the stream. We adapt a theoretically-motivated human error framework of mistakes and slips for the human annotation task to study the effect of ordering instances (i.e., an "annotation schedule"). Further, we propose an error-avoidance approach to the active learning paradigm for stream processing applications robust to these likely human errors (in the form of slips) when deciding a human annotation schedule. We support the human error framework using crowdsourcing experiments and evaluate the proposed algorithm against standard baselines for active learning via extensive experimentation on classification tasks of filtering relevant social media posts during natural disasters.

CYJul 2, 2020

Towards Data-Driven Affirmative Action Policies under Uncertainty

Corinna Hertweck, Carlos Castillo, Michael Mathioudakis

In this paper, we study university admissions under a centralized system that uses grades and standardized test scores to match applicants to university programs. We consider affirmative action policies that seek to increase the number of admitted applicants from underrepresented groups. Since such a policy has to be announced before the start of the application period, there is uncertainty about the score distribution of the students applying to each program. This poses a difficult challenge for policy-makers. We explore the possibility of using a predictive model trained on historical data to help optimize the parameters of such policies.

LGApr 15, 2020

Poisoning Attacks on Algorithmic Fairness

David Solans, Battista Biggio, Carlos Castillo

Research in adversarial machine learning has shown how the performance of machine learning models can be seriously compromised by injecting even a small fraction of poisoning points into the training data. While the effects on model accuracy of such poisoning attacks have been widely studied, their potential effects on other model performance metrics remain to be evaluated. In this work, we introduce an optimization framework for poisoning attacks against algorithmic fairness, and develop a gradient-based poisoning attack aimed at introducing classification disparities among different groups in the data. We empirically show that our attack is effective not only in the white-box setting, in which the attacker has full access to the target model, but also in a more challenging black-box scenario in which the attacks are optimized against a substitute model and then transferred to the target model. We believe that our findings pave the way towards the definition of an entirely novel set of adversarial attacks targeting algorithmic fairness in different scenarios, and that investigating such vulnerabilities will help design more robust algorithms and countermeasures in the future.

LGMar 10, 2020

Addressing multiple metrics of group fairness in data-driven decision making

Marius Miron, Songül Tolan, Emilia Gómez et al.

The Fairness, Accountability, and Transparency in Machine Learning (FAT-ML) literature proposes a varied set of group fairness metrics to measure discrimination against socio-demographic groups that are characterized by a protected feature, such as gender or race.Such a system can be deemed as either fair or unfair depending on the choice of the metric. Several metrics have been proposed, some of them incompatible with each other.We do so empirically, by observing that several of these metrics cluster together in two or three main clusters for the same groups and machine learning methods. In addition, we propose a robust way to visualize multidimensional fairness in two dimensions through a Principal Component Analysis (PCA) of the group fairness metrics. Experimental results on multiple datasets show that the PCA decomposition explains the variance between the metrics with one to three components.

OCFeb 16, 2020

Algorithms for Hiring and Outsourcing in the Online Labor Market

Aris Anagnostopoulos, Carlos Castillo, Adriano Fazzone et al.

Although freelancing work has grown substantially in recent years, in part facilitated by a number of online labor marketplaces, (e.g., Guru, Freelancer, Amazon Mechanical Turk), traditional forms of "in-sourcing" work continue being the dominant form of employment. This means that, at least for the time being, freelancing and salaried employment will continue to co-exist. In this paper, we provide algorithms for outsourcing and hiring workers in a general setting, where workers form a team and contribute different skills to perform a task. We call this model team formation with outsourcing. In our model, tasks arrive in an online fashion: neither the number nor the composition of the tasks is known a-priori. At any point in time, there is a team of hired workers who receive a fixed salary independently of the work they perform. This team is dynamic: new members can be hired and existing members can be fired, at some cost. Additionally, some parts of the arriving tasks can be outsourced and thus completed by non-team members, at a premium. Our contribution is an efficient online cost-minimizing algorithm for hiring and firing team members and outsourcing tasks. We present theoretical bounds obtained using a primal-dual scheme proving that our algorithms have a logarithmic competitive approximation ratio. We complement these results with experiments using semi-synthetic datasets based on actual task requirements and worker skills from three large online labor marketplaces.

IRJan 23, 2020

Uneven Coverage of Natural Disasters in Wikipedia: the Case of Flood

Valerio Lorini, Javier Rando, Diego Saez-Trumper et al.

The usage of non-authoritative data for disaster management presents the opportunity of accessing timely information that might not be available through other means, as well as the challenge of dealing with several layers of biases. Wikipedia, a collaboratively-produced encyclopedia, includes in-depth information about many natural and human-made disasters, and its editors are particularly good at adding information in real-time as a crisis unfolds. In this study, we focus on the English version of Wikipedia, that is by far the most comprehensive version of this encyclopedia. Wikipedia tends to have good coverage of disasters, particularly those having a large number of fatalities. However, we also show that a tendency to cover events in wealthy countries and not cover events in poorer ones permeates Wikipedia as a source for disaster-related information. By performing careful automatic content analysis at a large scale, we show how the coverage of floods in Wikipedia is skewed towards rich, English-speaking countries, in particular the US and Canada. We also note how coverage of floods in countries with the lowest income, as well as countries in South America, is substantially lower than the coverage of floods in middle-income countries. These results have implications for systems using Wikipedia or similar collaborative media platforms as an information source for detecting emergencies or for gathering valuable information for disaster response.

SIDec 5, 2019

EviDense: a Graph-based Method for Finding Unique High-impact Events with Succinct Keyword-based Descriptions

Oana Balalau, Carlos Castillo, Mauro Sozio

Despite the significant efforts made by the research community in recent years, automatically acquiring valuable information about high impact-events from social media remains challenging. We present EviDense, a graph-based approach for finding high-impact events (such as disaster events) in social media. One of the challenges we address in our work is to provide for each event a succinct keyword-based description, containing the most relevant information about it, such as what happened, the location, as well as its timeframe. We evaluate our approach on a large collection of tweets posted over a period of 19 months, using a crowdsourcing platform. Our evaluation shows that our method outperforms state-of-the-art approaches for the same problem, in terms of having higher precision, lower number of duplicates, and presenting a keyword-based description that is succinct and informative. We further improve the results of our algorithm by incorporating news from mainstream media. A preliminary version of this work was presented as a 4-pages short paper at ICWSM 2018.

SIAug 18, 2019

Modeling Islamist Extremist Communications on Social Media using Contextual Dimensions: Religion, Ideology, and Hate

Ugur Kursuncu, Manas Gaur, Carlos Castillo et al.

Terror attacks have been linked in part to online extremist content. Although tens of thousands of Islamist extremism supporters consume such content, they are a small fraction relative to peaceful Muslims. The efforts to contain the ever-evolving extremism on social media platforms have remained inadequate and mostly ineffective. Divergent extremist and mainstream contexts challenge machine interpretation, with a particular threat to the precision of classification algorithms. Our context-aware computational approach to the analysis of extremist content on Twitter breaks down this persuasion process into building blocks that acknowledge inherent ambiguity and sparsity that likely challenge both manual and automated classification. We model this process using a combination of three contextual dimensions -- religion, ideology, and hate -- each elucidating a degree of radicalization and highlighting independent features to render them computationally accessible. We utilize domain-specific knowledge resources for each of these contextual dimensions such as Qur'an for religion, the books of extremist ideologues and preachers for political ideology and a social media hate speech corpus for hate. Our study makes three contributions to reliable analysis: (i) Development of a computational approach rooted in the contextual dimensions of religion, ideology, and hate that reflects strategies employed by online Islamist extremist groups, (ii) An in-depth analysis of relevant tweet datasets with respect to these dimensions to exclude likely mislabeled users, and (iii) A framework for understanding online radicalization as a process to assist counter-programming. Given the potentially significant social impact, we evaluate the performance of our algorithms to minimize mislabeling, where our approach outperforms a competitive baseline by 10.2% in precision.

SIJul 16, 2019

Modeling Human Annotation Errors to Design Bias-Aware Systems for Social Stream Processing

Rahul Pandey, Carlos Castillo, Hemant Purohit

High-quality human annotations are necessary to create effective machine learning systems for social media. Low-quality human annotations indirectly contribute to the creation of inaccurate or biased learning systems. We show that human annotation quality is dependent on the ordering of instances shown to annotators (referred as 'annotation schedule'), and can be improved by local changes in the instance ordering provided to the annotators, yielding a more accurate annotation of the data stream for efficient real-time social media analytics. We propose an error-mitigating active learning algorithm that is robust with respect to some cases of human errors when deciding an annotation schedule. We validate the human error model and evaluate the proposed algorithm against strong baselines by experimenting on classification tasks of relevant social media posts during crises. According to these experiments, considering the order in which data instances are presented to human annotators leads to both an increase in accuracy for machine learning and awareness toward some potential biases in human learning that may affect the automated classifier.

IRMar 13, 2019

SciLens: Evaluating the Quality of Scientific News Articles Using Social Media and Scientific Literature Indicators

Panayiotis Smeros, Carlos Castillo, Karl Aberer

This paper describes, develops, and validates SciLens, a method to evaluate the quality of scientific news articles. The starting point for our work are structured methodologies that define a series of quality aspects for manually evaluating news. Based on these aspects, we describe a series of indicators of news quality. According to our experiments, these indicators help non-experts evaluate more accurately the quality of a scientific news article, compared to non-experts that do not have access to these indicators. Furthermore, SciLens can also be used to produce a completely automated quality score for an article, which agrees more with expert evaluators than manual evaluations done by non-experts. One of the main elements of SciLens is the focus on both content and context of articles, where context is provided by (1) explicit and implicit references on the article to scientific literature, and (2) reactions in social media referencing the article. We show that both contextual elements can be valuable sources of information for determining article quality. The validation of SciLens, done through a combination of expert and non-expert annotation, demonstrates its effectiveness for both semi-automatic and automatic quality evaluation of scientific news.

IRJul 17, 2018

To Post or Not to Post: Using Online Trends to Predict Popularity of Offline Content

Sofiane Abbar, Carlos Castillo, Antonio Sanfilippo

Predicting the popularity of online content has attracted much attention in the past few years. In news rooms, for instance, journalists and editors are keen to know, as soon as possible, the articles that will bring the most traffic into their website. The relevant literature includes a number of approaches and algorithms to perform this forecasting. Most of the proposed methods require monitoring the popularity of content during some time after it is posted, before making any longer-term prediction. In this paper, we propose a new approach for predicting the popularity of news articles before they go online. Our approach complements existing content-based methods, and is based on a number of observations regarding article similarity and topicality. First, the popularity of a new article is correlated with the popularity of similar articles of recent publication. Second, the popularity of the new article is related to the recent historical popularity of its main topic. Based on these observations, we use time series forecasting to predict the number of visits an article will receive. Our experiments, conducted on a real data collection of articles in an international news website, demonstrate the effectiveness and efficiency of the proposed method.

AIJun 7, 2018

Assessing the impact of machine intelligence on human behaviour: an interdisciplinary endeavour

Emilia Gómez, Carlos Castillo, Vicky Charisi et al.

This document contains the outcome of the first Human behaviour and machine intelligence (HUMAINT) workshop that took place 5-6 March 2018 in Barcelona, Spain. The workshop was organized in the context of a new research programme at the Centre for Advanced Studies, Joint Research Centre of the European Commission, which focuses on studying the potential impact of artificial intelligence on human behaviour. The workshop gathered an interdisciplinary group of experts to establish the state of the art research in the field and a list of future research challenges to be addressed on the topic of human and machine intelligence, algorithm's potential impact on human cognitive capabilities and decision making, and evaluation and regulation needs. The document is made of short position statements and identification of challenges provided by each expert, and incorporates the result of the discussions carried out during the workshop. In the conclusion section, we provide a list of emerging research topics and strategies to be addressed in the near future.

IRMay 22, 2018

Reducing Disparate Exposure in Ranking: A Learning To Rank Approach

Meike Zehlike, Carlos Castillo

Ranked search results have become the main mechanism by which we find content, products, places, and people online. Thus their ordering contributes not only to the satisfaction of the searcher, but also to career and business opportunities, educational placement, and even social success of those being ranked. Researchers have become increasingly concerned with systematic biases in data-driven ranking models, and various post-processing methods have been proposed to mitigate discrimination and inequality of opportunity. This approach, however, has the disadvantage that it still allows an unfair ranking model to be trained. In this paper we explore a new in-processing approach: DELTR, a learning-to-rank framework that addresses potential issues of discrimination and unequal opportunity in rankings at training time. We measure these problems in terms of discrepancies in the average group exposure and design a ranker that optimizes search results in terms of relevance and in terms of reducing such discrepancies. We perform an extensive experimental study showing that being "colorblind" can be among the best or the worst choices from the perspective of relevance and exposure, depending on how much and which kind of bias is present in the training set. We show that our in-processing method performs better in terms of relevance and exposure than a pre-processing and a post-processing method across all tested scenarios.

CYJun 20, 2017

FA*IR: A Fair Top-k Ranking Algorithm

Meike Zehlike, Francesco Bonchi, Carlos Castillo et al.

In this work, we define and solve the Fair Top-k Ranking problem, in which we want to determine a subset of k candidates from a large pool of n >> k candidates, maximizing utility (i.e., select the "best" candidates) subject to group fairness criteria. Our ranked group fairness definition extends group fairness using the standard notion of protected groups and is based on ensuring that the proportion of protected candidates in every prefix of the top-k ranking remains statistically above or indistinguishable from a given minimum. Utility is operationalized in two ways: (i) every candidate included in the top-$k$ should be more qualified than every candidate not included; and (ii) for every pair of candidates in the top-k, the more qualified candidate should be ranked above. An efficient algorithm is presented for producing the Fair Top-k Ranking, and tested experimentally on existing datasets as well as new datasets released with this paper, showing that our approach yields small distortions with respect to rankings that maximize utility without considering fairness criteria. To the best of our knowledge, this is the first algorithm grounded in statistical tests that can mitigate biases in the representation of an under-represented group along a ranked list.

CVMay 21, 2017

The Do's and Don'ts for CNN-based Face Verification

Ankan Bansal, Carlos Castillo, Rajeev Ranjan et al.

While the research community appears to have developed a consensus on the methods of acquiring annotated data, design and training of CNNs, many questions still remain to be answered. In this paper, we explore the following questions that are critical to face recognition research: (i) Can we train on still images and expect the systems to work on videos? (ii) Are deeper datasets better than wider datasets? (iii) Does adding label noise lead to improvement in performance of deep networks? (iv) Is alignment needed for face recognition? We address these questions by training CNNs using CASIA-WebFace, UMDFaces, and a new video dataset and testing on YouTube- Faces, IJB-A and a disjoint portion of UMDFaces datasets. Our new data set, which will be made publicly available, has 22,075 videos and 3,735,476 human annotated frames extracted from them.

CVFeb 15, 2017

Deep Heterogeneous Feature Fusion for Template-Based Face Recognition

Navaneeth Bodla, Jingxiao Zheng, Hongyu Xu et al.

Although deep learning has yielded impressive performance for face recognition, many studies have shown that different networks learn different feature maps: while some networks are more receptive to pose and illumination others appear to capture more local information. Thus, in this work, we propose a deep heterogeneous feature fusion network to exploit the complementary information present in features generated by different deep convolutional neural networks (DCNNs) for template-based face recognition, where a template refers to a set of still face images or video frames from different sources which introduces more blur, pose, illumination and other variations than traditional face datasets. The proposed approach efficiently fuses the discriminative information of different deep features by 1) jointly learning the non-linear high-dimensional projection of the deep features and 2) generating a more discriminative template representation which preserves the inherent geometry of the deep features in the feature space. Experimental results on the IARPA Janus Challenge Set 3 (Janus CS3) dataset demonstrate that the proposed method can effectively improve the recognition performance. In addition, we also present a series of covariate experiments on the face verification task for in-depth qualitative evaluations for the proposed approach.

CVJan 9, 2017

Son of Zorn's Lemma: Targeted Style Transfer Using Instance-aware Semantic Segmentation

Carlos Castillo, Soham De, Xintong Han et al.

Style transfer is an important task in which the style of a source image is mapped onto that of a target image. The method is useful for synthesizing derivative works of a particular artist or specific painting. This work considers targeted style transfer, in which the style of a template image is used to alter only part of a target image. For example, an artist may wish to alter the style of only one particular object in a target image without altering the object's general morphology or surroundings. This is useful, for example, in augmented reality applications (such as the recently released Pokemon GO), where one wants to alter the appearance of a single real-world object in an image frame to make it appear as a cartoon. Most notably, the rendering of real-world objects into cartoon characters has been used in a number of films and television show, such as the upcoming series Son of Zorn. We present a method for targeted style transfer that simultaneously segments and stylizes single objects selected by the user. The method uses a Markov random field model to smooth and anti-alias outlier pixels near object boundaries, so that stylized objects naturally blend into their surroundings.

CVNov 6, 2016

Deep Convolutional Neural Network Features and the Original Image

Connor J. Parde, Carlos Castillo, Matthew Q. Hill et al.

Face recognition algorithms based on deep convolutional neural networks (DCNNs) have made progress on the task of recognizing faces in unconstrained viewing conditions. These networks operate with compact feature-based face representations derived from learning a very large number of face images. While the learned features produced by DCNNs can be highly robust to changes in viewpoint, illumination, and appearance, little is known about the nature of the face code that emerges at the top level of such networks. We analyzed the DCNN features produced by two face recognition algorithms. In the first set of experiments we used the top-level features from the DCNNs as input into linear classifiers aimed at predicting metadata about the images. The results show that the DCNN features contain surprisingly accurate information about the yaw and pitch of a face, and about whether the face came from a still image or a video frame. In the second set of experiments, we measured the extent to which individual DCNN features operated in a view-dependent or view-invariant manner. We found that view-dependent coding was a characteristic of the identities rather than the DCNN features - with some identities coded consistently in a view-dependent way and others in a view-independent way. In our third analysis, we visualized the DCNN feature space for over 24,000 images of 500 identities. Images in the center of the space were uniformly of low quality (e.g., extreme views, face occlusion, low resolution). Image quality increased monotonically as a function of distance from the origin. This result suggests that image quality information is available in the DCNN features, such that consistently average feature values reflect coding failures that reliably indicate poor or unusable images. Combined, the results offer insight into the coding mechanisms that support robust representation of faces in DCNNs.

CVNov 4, 2016

UMDFaces: An Annotated Face Dataset for Training Deep Networks

Ankan Bansal, Anirudh Nanduri, Carlos Castillo et al.

Recent progress in face detection (including keypoint detection), and recognition is mainly being driven by (i) deeper convolutional neural network architectures, and (ii) larger datasets. However, most of the large datasets are maintained by private companies and are not publicly available. The academic computer vision community needs larger and more varied datasets to make further progress. In this paper we introduce a new face dataset, called UMDFaces, which has 367,888 annotated faces of 8,277 subjects. We also introduce a new face recognition evaluation protocol which will help advance the state-of-the-art in this area. We discuss how a large dataset can be collected and annotated using human annotators and deep networks. We provide human curated bounding boxes for faces. We also provide estimated pose (roll, pitch and yaw), locations of twenty-one key-points and gender information generated by a pre-trained neural network. In addition, the quality of keypoint annotations has been verified by humans for about 115,000 images. Finally, we compare the quality of the dataset with other publicly available face datasets at similar scales.

CLOct 6, 2016

A Robust Framework for Classifying Evolving Document Streams in an Expert-Machine-Crowd Setting

Muhammad Imran, Sanjay Chawla, Carlos Castillo

An emerging challenge in the online classification of social media data streams is to keep the categories used for classification up-to-date. In this paper, we propose an innovative framework based on an Expert-Machine-Crowd (EMC) triad to help categorize items by continuously identifying novel concepts in heterogeneous data streams often riddled with outliers. We unify constrained clustering and outlier detection by formulating a novel optimization problem: COD-Means. We design an algorithm to solve the COD-Means problem and show that COD-Means will not only help detect novel categories but also seamlessly discover human annotation errors and improve the overall quality of the categorization process. Experiments on diverse real data sets demonstrate that our approach is both effective and efficient.

CVMay 31, 2016

Biconvex Relaxation for Semidefinite Programming in Computer Vision

Sohil Shah, Abhay Kumar, Carlos Castillo et al.

Semidefinite programming is an indispensable tool in computer vision, but general-purpose solvers for semidefinite programs are often too slow and memory intensive for large-scale problems. We propose a general framework to approximately solve large-scale semidefinite problems (SDPs) at low complexity. Our approach, referred to as biconvex relaxation (BCR), transforms a general SDP into a specific biconvex optimization problem, which can then be solved in the original, low-dimensional variable space at low complexity. The resulting biconvex problem is solved using an efficient alternating minimization (AM) procedure. Since AM has the potential to get stuck in local minima, we propose a general initialization scheme that enables BCR to start close to a global optimum - this is key for our algorithm to quickly converge to optimal or near-optimal solutions. We showcase the efficacy of our approach on three applications in computer vision, namely segmentation, co-segmentation, and manifold metric learning. BCR achieves solution quality comparable to state-of-the-art SDP methods with speedups between 4X and 35X. At the same time, BCR handles a more general set of SDPs than previous approaches, which are more specialized.

CLMay 19, 2016

Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages

Muhammad Imran, Prasenjit Mitra, Carlos Castillo

Microblogging platforms such as Twitter provide active communication channels during mass convergence and emergency events such as earthquakes, typhoons. During the sudden onset of a crisis situation, affected people post useful information on Twitter that can be used for situational awareness and other humanitarian disaster response efforts, if processed timely and effectively. Processing social media information pose multiple challenges such as parsing noisy, brief and informal messages, learning information categories from the incoming stream of messages and classifying them into different classes among others. One of the basic necessities of many of these tasks is the availability of data, in particular human-annotated data. In this paper, we present human-annotated Twitter corpora collected during 19 different crises that took place between 2013 and 2015. To demonstrate the utility of the annotations, we train machine learning classifiers. Moreover, we publish first largest word2vec word embeddings trained on 52 million crisis-related tweets. To deal with tweets language issues, we present human-annotated normalized lexical resources for different lexical variations.

CVApr 19, 2016

Triplet Probabilistic Embedding for Face Verification and Clustering

Swami Sankaranarayanan, Azadeh Alavi, Carlos Castillo et al.

Despite significant progress made over the past twenty five years, unconstrained face verification remains a challenging problem. This paper proposes an approach that couples a deep CNN-based approach with a low-dimensional discriminative embedding learned using triplet probability constraints to solve the unconstrained face verification problem. Aside from yielding performance improvements, this embedding provides significant advantages in terms of memory and for post-processing operations like subject specific clustering. Experiments on the challenging IJB-A dataset show that the proposed algorithm performs comparably or better than the state of the art methods in verification and identification metrics, while requiring much less training data and training time. The superior performance of the proposed method on the CFP dataset shows that the representation learned by our deep CNN is robust to extreme pose variation. Furthermore, we demonstrate the robustness of the deep features to challenges including age, pose, blur and clutter by performing simple clustering experiments on both IJB-A and LFW datasets.

CYFeb 26, 2016

Enabling Digital Health by Automatic Classification of Short Messages

Muhammad Imran, Patrick Meier, Carlos Castillo et al.

In response to the growing HIV/AIDS and other health-related issues, UNICEF through their U-Report platform receives thousands of messages (SMS) every day to provide prevention strategies, health case advice, and counsel- ing support to vulnerable population. Due to a rapid increase in U-Report usage (up to 300% in last 3 years), plus approximately 1,000 new registrations each day, the volume of messages has thus continued to increase, which made it impossible for the team at UNICEF to process them in a timely manner. In this paper, we present a platform designed to perform automatic classification of short messages (SMS) in real-time to help UNICEF categorize and prioritize health-related messages as they arrive. We employ a hybrid approach, which combines human and machine intelligence that seeks to resolve the information overload issue by introducing processing of large-scale data at high-speed while maintaining a high classification accuracy. The system has recently been tested in conjunction with UNICEF in Zambia to classify short messages received via the U-Report platform on various health related issues. The system is designed to enable UNICEF make sense of a large volume of short messages in a timely manner. In terms of evaluation, we report design choices, challenges, and performance of the system observed during the deployment to validate its effectiveness.

CYSep 29, 2014

Controversy and Sentiment in Online News

Yelena Mejova, Amy X. Zhang, Nicholas Diakopoulos et al.

How do news sources tackle controversial issues? In this work, we take a data-driven approach to understand how controversy interplays with emotional expression and biased language in the news. We begin by introducing a new dataset of controversial and non-controversial terms collected using crowdsourcing. Then, focusing on 15 major U.S. news outlets, we compare millions of articles discussing controversial and non-controversial issues over a span of 7 months. We find that in general, when it comes to controversial issues, the use of negative affect and biased language is prevalent, while the use of strong emotion is tempered. We also observe many differences across news sources. Using these findings, we show that we can indicate to what extent an issue is controversial, by comparing it with other issues in terms of how they are portrayed across different media.

CRMay 21, 2014

TweetCred: Real-Time Credibility Assessment of Content on Twitter

Aditi Gupta, Ponnurangam Kumaraguru, Carlos Castillo et al.

During sudden onset crisis events, the presence of spam, rumors and fake content on Twitter reduces the value of information contained on its messages (or "tweets"). A possible solution to this problem is to use machine learning to automatically evaluate the credibility of a tweet, i.e. whether a person would deem the tweet believable or trustworthy. This has been often framed and studied as a supervised classification problem in an off-line (post-hoc) setting. In this paper, we present a semi-supervised ranking model for scoring tweets according to their credibility. This model is used in TweetCred, a real-time system that assigns a credibility score to tweets in a user's timeline. TweetCred, available as a browser plug-in, was installed and used by 1,127 Twitter users within a span of three months. During this period, the credibility score for about 5.4 million tweets was computed, allowing us to evaluate TweetCred in terms of response time, effectiveness and usability. To the best of our knowledge, this is the first research work to develop a real-time system for credibility on Twitter, and to evaluate it on a user base of this size.

DBOct 21, 2013

Engineering Crowdsourced Stream Processing Systems

Muhammad Imran, Ioanna Lykourentzou, Yannick Naudet et al.

A crowdsourced stream processing system (CSP) is a system that incorporates crowdsourced tasks in the processing of a data stream. This can be seen as enabling crowdsourcing work to be applied on a sample of large-scale data at high speed, or equivalently, enabling stream processing to employ human intelligence. It also leads to a substantial expansion of the capabilities of data processing systems. Engineering a CSP system requires the combination of human and machine computation elements. From a general systems theory perspective, this means taking into account inherited as well as emerging properties from both these elements. In this paper, we position CSP systems within a broader taxonomy, outline a series of design principles and evaluation metrics, present an extensible framework for their design, and describe several design patterns. We showcase the capabilities of CSP systems by performing a case study that applies our proposed framework to the design and analysis of a real system (AIDR) that classifies social media messages during time-critical crisis events. Results show that compared to a pure stream processing system, AIDR can achieve a higher data classification accuracy, while compared to a pure crowdsourcing solution, the system makes better use of human workers by requiring much less manual work effort.

CLJul 18, 2013

Says who? Automatic Text-Based Content Analysis of Television News

Carlos Castillo, Gianmarco De Francisci Morales, Marcelo Mendoza et al.

We perform an automatic analysis of television news programs, based on the closed captions that accompany them. Specifically, we collect all the news broadcasted in over 140 television channels in the US during a period of six months. We start by segmenting, processing, and annotating the closed captions automatically. Next, we focus on the analysis of their linguistic style and on mentions of people using NLP methods. We present a series of key insights about news providers, people in the news, and we discuss the biases that can be uncovered by automatic means. These insights are contrasted by looking at the data from multiple points of view, including qualitative assessment.