58.2CLMay 29
Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech DetectionBenedetta Muscato, Beiduo Chen, Gizem Gezici et al.
Human disagreement is ubiquitous and well-known in labeling. However, variation in explanations, captured through token-level human rationales, remains far less explored. At the same time, it is unclear how to best evaluate human labels and rationales -- or even how to best aggregate rationales beyond majority vote -- in light of this variation. Yet, rationales may provide additional insights into the richness of human reasoning, that may differ in style, values and interpretations -- especially in subjective NLP tasks like hate speech detection. In this work, we unify diverse models, training strategies, loss functions, and existing evaluation metrics under a single protocol by systematically re-implementing them across different label and rationale representation spaces. Classification metrics are organized around two key properties -- predictive and distributional -- while explainability metrics through three complementary dimensions: plausibility, faithfulness, and complexity. In this unified supervision framework, we evaluate model behavior across classification and explainability metrics, as well as metric sensitivity to the choice of label (hard and soft) and rationale representation space (hard, intermediate and soft). Results show that both hard and soft metrics favor softer representations, highlighting their effectiveness in capturing variation and the need to rethink evaluation in subjective NLP.
IROct 7, 2022
Quantifying Political Bias in News ArticlesGizem Gezici
Search bias analysis is getting more attention in recent years since search results could affect In this work, we aim to establish an automated model for evaluating ideological bias in online news articles. The dataset is composed of news articles in search results as well as the newspaper articles. The current automated model results show that model capability is not sufficient to be exploited for annotating the documents automatically, thereby computing bias in search results.
CYOct 31, 2022
Keywords for BiasAbdurrezak Efe, Gizem Gezici, Aysenur Uzun et al.
This work proposes to analyse some keywords for bias analysis. For this, we are using several NLP approaches and compare them based on their capability of detecting keywords to analyse bias. The overall findings show that our proposed approach gives comparable results with the state-of-the-art approaches on different benchmark datasets.
IRFeb 28, 2025
Hybrid Retrieval for Hallucination Mitigation in Large Language Models: A Comparative AnalysisChandana Sree Mala, Gizem Gezici, Fosca Giannotti
Large Language Models (LLMs) excel in language comprehension and generation but are prone to hallucinations, producing factually incorrect or unsupported outputs. Retrieval Augmented Generation (RAG) systems address this issue by grounding LLM responses with external knowledge. This study evaluates the relationship between retriever effectiveness and hallucination reduction in LLMs using three retrieval approaches: sparse retrieval based on BM25 keyword search, dense retrieval using semantic search with Sentence Transformers, and a proposed hybrid retrieval module. The hybrid module incorporates query expansion and combines the results of sparse and dense retrievers through a dynamically weighted Reciprocal Rank Fusion score. Using the HaluBench dataset, a benchmark for hallucinations in question answering tasks, we assess retrieval performance with metrics such as mean average precision and normalised discounted cumulative gain, focusing on the relevance of the top three retrieved documents. Results show that the hybrid retriever achieves better relevance scores, outperforming both sparse and dense retrievers. Further evaluation of LLM-generated answers against ground truth using metrics such as accuracy, hallucination rate, and rejection rate reveals that the hybrid retriever achieves the highest accuracy on fails, the lowest hallucination rate, and the lowest rejection rate. These findings highlight the hybrid retriever's ability to enhance retrieval relevance, reduce hallucination rates, and improve LLM reliability, emphasising the importance of advanced retrieval techniques in mitigating hallucinations and improving response accuracy.
CLMar 1, 2025
Embracing Diversity: A Multi-Perspective Approach with Soft LabelsBenedetta Muscato, Praveen Bushipaka, Gizem Gezici et al.
Prior studies show that adopting the annotation diversity shaped by different backgrounds and life experiences and incorporating them into the model learning, i.e. multi-perspective approach, contribute to the development of more responsible models. Thus, in this paper we propose a new framework for designing and further evaluating perspective-aware models on stance detection task,in which multiple annotators assign stances based on a controversial topic. We also share a new dataset established through obtaining both human and LLM annotations. Results show that the multi-perspective approach yields better classification performance (higher F1-scores), outperforming the traditional approaches that use a single ground-truth, while displaying lower model confidence scores, probably due to the high level of subjectivity of the stance detection task.
CLJun 25, 2025
Perspectives in Play: A Multi-Perspective Approach for More Inclusive NLP SystemsBenedetta Muscato, Lucia Passaro, Gizem Gezici et al.
In the realm of Natural Language Processing (NLP), common approaches for handling human disagreement consist of aggregating annotators' viewpoints to establish a single ground truth. However, prior studies show that disregarding individual opinions can lead can lead to the side effect of underrepresenting minority perspectives, especially in subjective tasks, where annotators may systematically disagree because of their preferences. Recognizing that labels reflect the diverse backgrounds, life experiences, and values of individuals, this study proposes a new multi-perspective approach using soft labels to encourage the development of the next generation of perspective aware models, more inclusive and pluralistic. We conduct an extensive analysis across diverse subjective text classification tasks, including hate speech, irony, abusive language, and stance detection, to highlight the importance of capturing human disagreements, often overlooked by traditional aggregation methods. Results show that the multi-perspective approach not only better approximates human label distributions, as measured by Jensen-Shannon Divergence (JSD), but also achieves superior classification performance (higher F1 scores), outperforming traditional approaches. However, our approach exhibits lower confidence in tasks like irony and stance detection, likely due to the inherent subjectivity present in the texts. Lastly, leveraging Explainable AI (XAI), we explore model uncertainty and uncover meaningful insights into model predictions.
CLNov 13, 2024
Multi-Perspective Stance DetectionBenedetta Muscato, Praveen Bushipaka, Gizem Gezici et al.
Subjective NLP tasks usually rely on human annotations provided by multiple annotators, whose judgments may vary due to their diverse backgrounds and life experiences. Traditional methods often aggregate multiple annotations into a single ground truth, disregarding the diversity in perspectives that arises from annotator disagreement. In this preliminary study, we examine the effect of including multiple annotations on model accuracy in classification. Our methodology investigates the performance of perspective-aware classification models in stance detection task and further inspects if annotator disagreement affects the model confidence. The results show that multi-perspective approach yields better classification performance outperforming the baseline which uses the single label. This entails that designing more inclusive perspective-aware AI models is not only an essential first step in implementing responsible and ethical AI, but it can also achieve superior results than using the traditional approaches.
CLOct 16, 2024
Learning by Surprise: Surplexity for Mitigating Model Collapse in Generative AIDaniele Gambetta, Gizem Gezici, Fosca Giannotti et al.
As synthetic content increasingly infiltrates the web, generative AI models may be retrained on their own outputs: a process termed "autophagy". This leads to model collapse: a progressive loss of performance and diversity across generations. Recent studies have examined the emergence of model collapse across various generative AI models and data types, and have proposed mitigation strategies that rely on incorporating human-authored content. However, current characterizations of model collapse remain limited, and existing mitigation methods assume reliable knowledge of whether training data is human-authored or AI-generated. In this paper, we address these gaps by introducing new measures that characterise collapse directly from a model's next-token probability distributions, rather than from properties of AI-generated text. Using these measures, we show that the degree of collapse depends on the complexity of the initial training set, as well as on the extent of autophagy. Our experiments prompt a new suggestion: that model collapse occurs when a model trains on data that does not "surprise" it. We express this hypothesis in terms of the well-known Free Energy Principle in cognitive science. Building on this insight, we propose a practical mitigation strategy: filtering training items by high surplexity, maximising the surprise of the model. Unlike existing methods, this approach does not require distinguishing between human- and AI-generated data. Experiments across datasets and models demonstrate that our strategy is at least as effective as human-data baselines, and even more effective in reducing distributional skewedness. Our results provide a richer understanding of model collapse and point toward more resilient approaches for training generative AI systems in environments increasingly saturated with synthetic data.
IRJun 29, 2024
A survey on the impacts of recommender systems on users, items, and human-AI ecosystemsLuca Pappalardo, Salvatore Citraro, Giuliano Cornacchia et al.
Recommendation systems and assistants (in short, recommenders) influence through online platforms most actions of our daily lives, suggesting items or providing solutions based on users' preferences or requests. This survey systematically reviews, categories, and discusses the impact of recommenders in four human-AI ecosystems -- social media, online retail, urban mapping and generative AI ecosystems. Its scope is to systematise a fast-growing field in which terminologies employed to classify methodologies and outcomes are fragmented and unsystematic. This is a crucial contribution to the literature because terminologies vary substantially across disciplines and ecosystems, hindering comparison and accumulation of knowledge in the field. We follow the customary steps of qualitative systematic review, gathering 154 articles from different disciplines to develop a parsimonious taxonomy of methodologies employed (empirical, simulation, observational, controlled), outcomes observed (concentration, content degradation, discrimination, diversity, echo chamber, filter bubble, homogenisation, polarisation, radicalisation, volume), and their level of analysis (individual, item, and ecosystem). We systematically discuss substantive and methodological commonalities across ecosystems, and highlight potential avenues for future research. The survey is addressed to scholars and practitioners interested in different human-AI ecosystems, policymakers and institutional stakeholders who want to understand better the measurable outcomes of recommenders, and tech companies who wish to obtain a systematic view of the impact of their recommenders.
IRDec 29, 2021
Literature Review of the Pioneering Approaches in Cloud-based Search Engines Powered by LETOR TechniquesGizem Gezici
Search engines play an essential role in our daily lives. Nonetheless, they are also very crucial in enterprise domain to access documents from various information sources. Since traditional search systems index the documents mainly by looking at the frequency of the occurring words in these documents, they are barely able to support natural language search, but rather keyword search. It seems that keyword based search will not be sufficient for enterprise data which is growing extremely fast. Thus, enterprise search becomes increasingly critical in corporate domain. In this report, we present an overview of the state-of-the-art technologies in literature for three main purposes: i) to increase the retrieval performance of a search engine, ii) to deploy a search platform to a cloud environment, and iii) to select the best terms in expanding queries for achieving even a higher retrieval performance as well as to provide good query suggestions to its users for a better user experience.
IRDec 28, 2021
Query Suggestion for Click-Absent Queries in Enterprise SearchGizem Gezici
Creating alternative queries, also known as query suggestion, has been proved to be helpful on improving users' search experience. Owing to the suggestions, users could retrieve their information need more quickly and accurately. In many scenarios, these suggestions could be generated from the click-through logs by establishing a bipartite graph of the clicked query-document pairs. Most of the existing methods focused on click-existing queries which possess clicked information in the search logs, to suggest related queries using the co-clicked documents. In this paper, we propose a simple yet effective query suggestion method particularly for click-absent queries by ensuring semantic consistency without utilising any additional resources. Our experimental results show that the proposed technique generates comparatively good suggestions for click-absent queries on a real bilingual enterprise search log.
IRDec 23, 2021
Biased or Not?: The Story of Two Search EnginesGizem Gezici
Search engines can be considered as a gate to the world of WEB, and they also decide what we see for a given search query. Since many people are exposed to information through search engines, it is fair to expect that search engines should be neutral; i.e. the returned results must cover all the elements or aspects of the search topic, and they should be impartial where the results are returned based on relevance. However, the search engine results are based on many features and sophisticated algorithms where search neutrality is not necessarily the focal point. In this work we performed an empirical study on two popular search engines and analysed the search engine result pages for controversial topics such as abortion, medical marijuana, and gay marriage. Our analysis is based on the sentiment in search results to identify their viewpoint as conservative or liberal. We also propose three sentiment-based metrics to show the existence of bias as well as to compare viewpoints of the two search engines. Extensive experiments performed on controversial topics show that both search engines are biased, moreover they have the same kind of bias towards a given controversial topic.
IRDec 23, 2021
Customising Ranking Models for Enterprise Search on Bilingual Click-Through DatasetGizem Gezici
In this work, we provide the details about the process of establishing an end-to-end system for enterprise search on bilingual click-through dataset. The first part of the paper will be about the high-level workflow of the system. Then, in the second part we will elaborately mention about the ranking models to improve the search results in the vertical search of the technical documents in enterprise domain. Throughout the paper, we will mention the way which we combine the methods in IR literature. Finally, in the last part of the paper we will report our results using different ranking algorithms with $NDCG@k$ where k is the cut-off value.