CLOct 28, 2022
"It's Not Just Hate'': A Multi-Dimensional Perspective on Detecting Harmful Speech OnlineFederico Bianchi, Stefanie Anja Hills, Patricia Rossini et al. · stanford
Well-annotated data is a prerequisite for good Natural Language Processing models. Too often, though, annotation decisions are governed by optimizing time or annotator agreement. We make a case for nuanced efforts in an interdisciplinary setting for annotating offensive online speech. Detecting offensive content is rapidly becoming one of the most important real-world NLP tasks. However, most datasets use a single binary label, e.g., for hate or incivility, even though each concept is multi-faceted. This modeling choice severely limits nuanced insights, but also performance. We show that a more fine-grained multi-label approach to predicting incivility and hateful or intolerant content addresses both conceptual and performance issues. We release a novel dataset of over 40,000 tweets about immigration from the US and UK, annotated with six labels for different aspects of incivility and intolerance. Our dataset not only allows for a more nuanced understanding of harmful speech online, models trained on it also outperform or match performance on benchmark datasets.
CLDec 18, 2022
Beyond Digital "Echo Chambers": The Role of Viewpoint Diversity in Political DiscussionRishav Hada, Amir Ebrahimi Fard, Sarah Shugars et al.
Increasingly taking place in online spaces, modern political conversations are typically perceived to be unproductively affirming -- siloed in so called ``echo chambers'' of exclusively like-minded discussants. Yet, to date we lack sufficient means to measure viewpoint diversity in conversations. To this end, in this paper, we operationalize two viewpoint metrics proposed for recommender systems and adapt them to the context of social media conversations. This is the first study to apply these two metrics (Representation and Fragmentation) to real world data and to consider the implications for online conversations specifically. We apply these measures to two topics -- daylight savings time (DST), which serves as a control, and the more politically polarized topic of immigration. We find that the diversity scores for both Fragmentation and Representation are lower for immigration than for DST. Further, we find that while pro-immigrant views receive consistent pushback on the platform, anti-immigrant views largely operate within echo chambers. We observe less severe yet similar patterns for DST. Taken together, Representation and Fragmentation paint a meaningful and important new picture of viewpoint diversity.
AISep 18, 2023
Evaluation of Human-Understandability of Global Model Explanations using Decision TreeAdarsa Sivaprasad, Ehud Reiter, Nava Tintarev et al.
In explainable artificial intelligence (XAI) research, the predominant focus has been on interpreting models for experts and practitioners. Model agnostic and local explanation approaches are deemed interpretable and sufficient in many applications. However, in domains like healthcare, where end users are patients without AI or domain expertise, there is an urgent need for model explanations that are more comprehensible and instil trust in the model's operations. We hypothesise that generating model explanations that are narrative, patient-specific and global(holistic of the model) would enable better understandability and enable decision-making. We test this using a decision tree model to generate both local and global explanations for patients identified as having a high risk of coronary heart disease. These explanations are presented to non-expert users. We find a strong individual preference for a specific type of explanation. The majority of participants prefer global explanations, while a smaller group prefers local explanations. A task based evaluation of mental models of these participants provide valuable feedback to enhance narrative global explanations. This, in turn, guides the design of health informatics systems that are both trustworthy and actionable.
HCSep 11, 2023
A Co-design Study for Multi-Stakeholder Job Recommender System ExplanationsRoan Schellingerhout, Francesco Barile, Nava Tintarev
Recent legislation proposals have significantly increased the demand for eXplainable Artificial Intelligence (XAI) in many businesses, especially in so-called `high-risk' domains, such as recruitment. Within recruitment, AI has become commonplace, mainly in the form of job recommender systems (JRSs), which try to match candidates to vacancies, and vice versa. However, common XAI techniques often fall short in this domain due to the different levels and types of expertise of the individuals involved, making explanations difficult to generalize. To determine the explanation preferences of the different stakeholder types - candidates, recruiters, and companies - we created and validated a semi-structured interview guide. Using grounded theory, we structurally analyzed the results of these interviews and found that different stakeholder types indeed have strongly differing explanation preferences. Candidates indicated a preference for brief, textual explanations that allow them to quickly judge potential matches. On the other hand, hiring managers preferred visual graph-based explanations that provide a more technical and comprehensive overview at a glance. Recruiters found more exhaustive textual explanations preferable, as those provided them with more talking points to convince both parties of the match. Based on these findings, we describe guidelines on how to design an explanation interface that fulfills the requirements of all three stakeholder types. Furthermore, we provide the validated interview guide, which can assist future research in determining the explanation preferences of different stakeholder types.
HCSep 24, 2024
Creating Healthy Friction: Determining Stakeholder Requirements of Job Recommendation ExplanationsRoan Schellingerhout, Francesco Barile, Nava Tintarev
The increased use of information retrieval in recruitment, primarily through job recommender systems (JRSs), can have a large impact on job seekers, recruiters, and companies. As a result, such systems have been determined to be high-risk in recent legislature. This requires JRSs to be trustworthy and transparent, allowing stakeholders to understand why specific recommendations were made. To fulfill this requirement, the stakeholders' exact preferences and needs need to be determined. To do so, we evaluated an explainable job recommender system using a realistic, task-based, mixed-design user study (n=30) in which stakeholders had to make decisions based on the model's explanations. This mixed-methods evaluation consisted of two objective metrics - correctness and efficiency, along with three subjective metrics - trust, transparency, and usefulness. These metrics were evaluated twice per participant, once using real explanations and once using random explanations. The study included a qualitative analysis following a think-aloud protocol while performing tasks adapted to each stakeholder group. We find that providing stakeholders with real explanations does not significantly improve decision-making speed and accuracy. Our results showed a non-significant trend for the real explanations to outperform the random ones on perceived trust, usefulness, and transparency of the system for all stakeholder types. We determine that stakeholders benefit more from interacting with explanations as decision support capable of providing healthy friction, rather than as previously-assumed persuasive tools.
IRJan 9, 2025
De-centering the (Traditional) User: Multistakeholder Evaluation of Recommender SystemsRobin Burke, Gediminas Adomavicius, Toine Bogers et al.
Multistakeholder recommender systems are those that account for the impacts and preferences of multiple groups of individuals, not just the end users receiving recommendations. Due to their complexity, these systems cannot be evaluated strictly by the overall utility of a single stakeholder, as is often the case of more mainstream recommender system applications. In this article, we focus our discussion on the challenges of multistakeholder evaluation of recommender systems. We bring attention to the different aspects involved -- from the range of stakeholders involved (including but not limited to providers and consumers) to the values and specific goals of each relevant stakeholder. We discuss how to move from theoretical principles to practical implementation, providing specific use case examples. Finally, we outline open research directions for the RecSys community to explore. We aim to provide guidance to researchers and practitioners about incorporating these complex and domain-dependent issues of evaluation in the course of designing, developing, and researching applications with multistakeholder aspects.
CLMay 8, 2025
The Pitfalls of Growing Group Complexity: LLMs and Social Choice-Based Aggregation for Group RecommendationsCedric Waterschoot, Nava Tintarev, Francesco Barile
Large Language Models (LLMs) are increasingly applied in recommender systems aimed at both individuals and groups. Previously, Group Recommender Systems (GRS) often used social choice-based aggregation strategies to derive a single recommendation based on the preferences of multiple people. In this paper, we investigate under which conditions language models can perform these strategies correctly based on zero-shot learning and analyse whether the formatting of the group scenario in the prompt affects accuracy. We specifically focused on the impact of group complexity (number of users and items), different LLMs, different prompting conditions, including In-Context learning or generating explanations, and the formatting of group preferences. Our results show that performance starts to deteriorate when considering more than 100 ratings. However, not all language models were equally sensitive to growing group complexity. Additionally, we showed that In-Context Learning (ICL) can significantly increase the performance at higher degrees of group complexity, while adding other prompt modifications, specifying domain cues or prompting for explanations, did not impact accuracy. We conclude that future research should include group complexity as a factor in GRS evaluation due to its effect on LLM performance. Furthermore, we showed that formatting the group scenarios differently, such as rating lists per user or per item, affected accuracy. All in all, our study implies that smaller LLMs are capable of generating group recommendations under the right conditions, making the case for using smaller models that require less computing power and costs.
92.3CYApr 9
Keeping an Eye on AI: A Framework for Effective Human Oversight of AI SystemsSusanne Gaube, Markus Langer, Tim Miller et al.
The use of Artificial Intelligence (AI) in high-risk, decision-making scenarios presents technical, safety, and normative challenges; problems that may only be ameliorated by human oversight. However, notions of human oversight lack a common foundational understanding: oversight architectures are not well defined, the roles involved remain unclear, and implementation steps are opaque. Hence, researchers and practitioners struggle to determine how to design, implement, and evaluate systems that enable effective human oversight. This paper advances a practical framework for effective human oversight of AI systems, based on a cross-disciplinary perspective that draws on insights from computer science, human-computer interaction, psychology, philosophy, and law. The core contributions are: (1) a foundational framework, with a working definition, architecture and processes for effective human oversight of AI systems; (2) an initial template for documenting oversight architectures and processes, applied to diverse domains; and (3) a synthesis of open research challenges that need to be considered in the emerging field of effective human oversight of AI systems.
IRMar 17, 2025
OKRA: an Explainable, Heterogeneous, Multi-Stakeholder Job Recommender SystemRoan Schellingerhout, Francesco Barile, Nava Tintarev
The use of recommender systems in the recruitment domain has been labeled as 'high-risk' in recent legislation. As a result, strict requirements regarding explainability and fairness have been put in place to ensure proper treatment of all involved stakeholders. To allow for stakeholder-specific explainability, while also handling highly heterogeneous recruitment data, we propose a novel explainable multi-stakeholder job recommender system using graph neural networks: the Occupational Knowledge-based Recommender using Attention (OKRA). The proposed method is capable of providing both candidate- and company-side recommendations and explanations. We find that OKRA performs substantially better than six baselines in terms of nDCG for two datasets. Furthermore, we find that the tested models show a bias toward candidates and vacancies located in urban areas. Overall, our findings suggest that OKRA provides a balance between accuracy, explainability, and fairness.
CVNov 21, 2025
A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance FeedbackBulat Khaertdinov, Mirela Popa, Nava Tintarev
Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, we introduce and evaluate four feedback strategies for VLM-based retrieval. First, we revise classical pseudo-relevance feedback (PRF), which refines query embeddings based on top-ranked results. To address its limitations, we propose generative relevance feedback (GRF), which uses synthetic captions for query refinement. Furthermore, we introduce an attentive feedback summarizer (AFS), a custom transformer-based model that integrates multimodal fine-grained features from relevant items. Finally, we simulate explicit feedback using ground-truth captions as an upper-bound baseline. Experiments on Flickr30k and COCO with the VLM backbones show that GRF, AFS, and explicit feedback improve retrieval performance by 3-5% in MRR@5 for smaller VLMs, and 1-3% for larger ones, compared to retrieval with no feedback. Moreover, AFS, similarly to explicit feedback, mitigates query drift and is more robust than GRF in iterative, multi-turn retrieval settings. Our findings demonstrate that relevance feedback can consistently enhance retrieval across VLMs and open up opportunities for interactive and adaptive visual search.
CLJul 18, 2025
Consistent Explainers or Unreliable Narrators? Understanding LLM-generated Group RecommendationsCedric Waterschoot, Nava Tintarev, Francesco Barile
Large Language Models (LLMs) are increasingly being implemented as joint decision-makers and explanation generators for Group Recommender Systems (GRS). In this paper, we evaluate these recommendations and explanations by comparing them to social choice-based aggregation strategies. Our results indicate that LLM-generated recommendations often resembled those produced by Additive Utilitarian (ADD) aggregation. However, the explanations typically referred to averaging ratings (resembling but not identical to ADD aggregation). Group structure, uniform or divergent, did not impact the recommendations. Furthermore, LLMs regularly claimed additional criteria such as user or item similarity, diversity, or used undefined popularity metrics or thresholds. Our findings have important implications for LLMs in the GRS pipeline as well as standard aggregation strategies. Additional criteria in explanations were dependent on the number of ratings in the group scenario, indicating potential inefficiency of standard aggregation methods at larger item set sizes. Additionally, inconsistent and ambiguous explanations undermine transparency and explainability, which are key motivations behind the use of LLMs for GRS.
HCJan 29, 2021
Disparate Impact Diminishes Consumer Trust Even for Advantaged UsersTim Draws, Zoltán Szlávik, Benjamin Timmermans et al.
Systems aiming to aid consumers in their decision-making (e.g., by implementing persuasive techniques) are more likely to be effective when consumers trust them. However, recent research has demonstrated that the machine learning algorithms that often underlie such technology can act unfairly towards specific groups (e.g., by making more favorable predictions for men than for women). An undesired disparate impact resulting from this kind of algorithmic unfairness could diminish consumer trust and thereby undermine the purpose of the system. We studied this effect by conducting a between-subjects user study investigating how (gender-related) disparate impact affected consumer trust in an app designed to improve consumers' financial decision-making. Our results show that disparate impact decreased consumers' trust in the system and made them less likely to use it. Moreover, we find that trust was affected to the same degree across consumer groups (i.e., advantaged and disadvantaged users) despite both of these consumer groups recognizing their respective levels of personal benefit. Our findings highlight the importance of fairness in consumer-oriented artificial intelligence systems.
IRJan 15, 2021
Operationalizing Framing to Support Multiperspective Recommendations of Opinion PiecesMats Mulder, Oana Inel, Jasper Oosterman et al.
Diversity in personalized news recommender systems is often defined as dissimilarity, and based on topic diversity (e.g., corona versus farmers strike). Diversity in news media, however, is understood as multiperspectivity (e.g., different opinions on corona measures), and arguably a key responsibility of the press in a democratic society. While viewpoint diversity is often considered synonymous with source diversity in communication science domain, in this paper, we take a computational view. We operationalize the notion of framing, adopted from communication science. We apply this notion to a re-ranking of topic-relevant recommended lists, to form the basis of a novel viewpoint diversification method. Our offline evaluation indicates that the proposed method is capable of enhancing the viewpoint diversity of recommendation lists according to a diversity metric from literature. In an online study, on the Blendle platform, a Dutch news aggregator platform, with more than 2000 users, we found that users are willing to consume viewpoint diverse news recommendations. We also found that presentation characteristics significantly influence the reading behaviour of diverse recommendations. These results suggest that future research on presentation aspects of recommendations can be just as important as novel viewpoint diversification methods to truly achieve multiperspectivity in online news environments.
IROct 27, 2020
Assessing Viewpoint Diversity in Search Results Using Ranking Fairness MetricsTim Draws, Nava Tintarev, Ujwal Gadiraju et al.
The way pages are ranked in search results influences whether the users of search engines are exposed to more homogeneous, or rather to more diverse viewpoints. However, this viewpoint diversity is not trivial to assess. In this paper we use existing and novel ranking fairness metrics to evaluate viewpoint diversity in search result rankings. We conduct a controlled simulation study that shows how ranking fairness metrics can be used for viewpoint diversity, how their outcome should be interpreted, and which metric is most suitable depending on the situation. This paper lays out important ground work for future research to measure and assess viewpoint diversity in real search result rankings.
CLOct 23, 2020
Helping users discover perspectives: Enhancing opinion mining with joint topic modelsTim Draws, Jody Liu, Nava Tintarev
Support or opposition concerning a debated claim such as abortion should be legal can have different underlying reasons, which we call perspectives. This paper explores how opinion mining can be enhanced with joint topic modeling, to identify distinct perspectives within the topic, providing an informative overview from unstructured text. We evaluate four joint topic models (TAM, JST, VODUM, and LAM) in a user study assessing human understandability of the extracted perspectives. Based on the results, we conclude that joint topic models such as TAM can discover perspectives that align with human judgments. Moreover, our results suggest that users are not influenced by their pre-existing stance on the topic of abortion when interpreting the output of topic models.
IRSep 6, 2020
Contextual Personalized Re-Ranking of Music Recommendations through Audio FeaturesBoning Gong, Mesut Kaya, Nava Tintarev
Users are able to access millions of songs through music streaming services like Spotify, Pandora, and Deezer. Access to such large catalogs, created a need for relevant song recommendations. However, user preferences are highly subjective in nature and change according to context (e.g., music that is suitable in the morning is not as suitable in the evening). Moreover, the music one user may prefer in a given context may be different from what another user prefers in the same context (i.e., what is considered good morning music differs across users). Accurately representing these preferences is essential to creating accurate and effective song recommendations. User preferences for songs can be based on high level audio features, such as tempo and valence. In this paper, we therefore propose a contextual re-ranking algorithm, based on audio feature representations of user preferences in specific contextual conditions. We evaluate the performance of our re-ranking algorithm using the #NowPlaying-RS dataset, which exists of user listening events crawled from Twitter and is enriched with song audio features. We compare a global (context for all users) and personalized (context for each user) model based on these audio features. The global model creates an audio feature representation of each contextual condition based on the preferences of all users. Unlike the global model, the personalized model creates user-specific audio feature representations of contextual conditions, and is measured across 333 distinct users. We show that the personalized model outperforms the global model when evaluated using the precision and mean average precision metrics.
HCMay 1, 2020
Eliciting User Preferences for Personalized Explanations for Video SummariesOana Inel, Nava Tintarev, Lora Aroyo
Video summaries or highlights are a compelling alternative for exploring and contextualizing unprecedented amounts of video material. However, the summarization process is commonly automatic, non-transparent and potentially biased towards particular aspects depicted in the original video. Therefore, our aim is to help users like archivists or collection managers to quickly understand which summaries are the most representative for an original video. In this paper, we present empirical results on the utility of different types of visual explanations to achieve transparency for end users on how representative video summaries are, with respect to the original video. We consider four types of video summary explanations, which use in different ways the concepts extracted from the original video subtitles and the video stream, and their prominence. The explanations are generated to meet target user preferences and express different dimensions of transparency: concept prominence, semantic coverage, distance and quantity of coverage. In two user studies we evaluate the utility of the visual explanations for achieving transparency for end users. Our results show that explanations representing all of the dimensions have the highest utility for transparency, and consequently, for understanding the representativeness of video summaries.
IRAug 9, 2017
Ephemeral Context to Support Robust and Diverse Music RecommendationsPavel Kucherbaev, Nava Tintarev, Carlos Rodriguez
While prior work on context-based music recommendation focused on fixed set of contexts (e.g. walking, driving, jogging), we propose to use multiple sensors and external data sources to describe momentary (ephemeral) context in a rich way with a very large number of possible states (e.g. jogging fast along in downtown of Sydney under a heavy rain at night being tired and angry). With our approach, we address the problems which current approaches face: 1) a limited ability to infer context from missing or faulty sensor data; 2) an inability to use contextual information to support novel content discovery.