CLJan 13, 2023
Multilingual Detection of Check-Worthy Claims using World Languages and Adapter FusionIpek Baris Schlicht, Lucie Flek, Paolo Rosso
Check-worthiness detection is the task of identifying claims, worthy to be investigated by fact-checkers. Resource scarcity for non-world languages and model learning costs remain major challenges for the creation of models supporting multilingual check-worthiness detection. This paper proposes cross-training adapters on a subset of world languages, combined by adapter fusion, to detect claims emerging globally in multiple languages. (1) With a vast number of annotators available for world languages and the storage-efficient adapter models, this approach is more cost efficient. Models can be updated more frequently and thus stay up-to-date. (2) Adapter fusion provides insights and allows for interpretation regarding the influence of each adapter model on a particular language. The proposed solution often outperformed the top multilingual approaches in our benchmark tasks.
CLNov 14, 2023Code
Spot: A Natural Language Interface for Geospatial Searches in OSMLynn Khellaf, Ipek Baris Schlicht, Julia Bayer et al.
Investigative journalists and fact-checkers have found OpenStreetMap (OSM) to be an invaluable resource for their work due to its extensive coverage and intricate details of various locations, which play a crucial role in investigating news scenes. Despite its value, OSM's complexity presents considerable accessibility and usability challenges, especially for those without a technical background. To address this, we introduce 'Spot', a user-friendly natural language interface for querying OSM data. Spot utilizes a semantic mapping from natural language to OSM tags, leveraging artificially generated sentence queries and a T5 transformer. This approach enables Spot to extract relevant information from user-input sentences and display candidate locations matching the descriptions on a map. To foster collaboration and future advancement, all code and generated data is available as an open-source repository.
CLJul 7, 2023
DWReCO at CheckThat! 2023: Enhancing Subjectivity Detection through Style-based Data SamplingIpek Baris Schlicht, Lynn Khellaf, Defne Altiok
This paper describes our submission for the subjectivity detection task at the CheckThat! Lab. To tackle class imbalances in the task, we have generated additional training materials with GPT-3 models using prompts of different styles from a subjectivity checklist based on journalistic perspective. We used the extended training set to fine-tune language-specific transformer models. Our experiments in English, German and Turkish demonstrate that different subjective styles are effective across all languages. In addition, we observe that the style-based oversampling is better than paraphrasing in Turkish and English. Lastly, the GPT-3 models sometimes produce lacklustre results when generating style-based texts in non-English languages.
IRJun 16, 2025Code
SPOT: Bridging Natural Language and Geospatial Search for Investigative JournalistsLynn Khellaf, Ipek Baris Schlicht, Tilman Mirass et al.
OpenStreetMap (OSM) is a vital resource for investigative journalists doing geolocation verification. However, existing tools to query OSM data such as Overpass Turbo require familiarity with complex query languages, creating barriers for non-technical users. We present SPOT, an open source natural language interface that makes OSM's rich, tag-based geographic data more accessible through intuitive scene descriptions. SPOT interprets user inputs as structured representations of geospatial object configurations using fine-tuned Large Language Models (LLMs), with results being displayed in an interactive map interface. While more general geospatial search tasks are conceivable, SPOT is specifically designed for use in investigative journalism, addressing real-world challenges such as hallucinations in model output, inconsistencies in OSM tagging, and the noisy nature of user input. It combines a novel synthetic data pipeline with a semantic bundling system to enable robust, accurate query generation. To our knowledge, SPOT is the first system to achieve reliable natural language access to OSM data at this level of accuracy. By lowering the technical barrier to geolocation verification, SPOT contributes a practical tool to the broader efforts to support fact-checking and combat disinformation.
LGMay 7, 2025Code
MedSyn: Enhancing Diagnostics with Human-AI CollaborationBurcu Sayin, Ipek Baris Schlicht, Ngoc Vo Hong et al.
Clinical decision-making is inherently complex, often influenced by cognitive biases, incomplete information, and case ambiguity. Large Language Models (LLMs) have shown promise as tools for supporting clinical decision-making, yet their typical one-shot or limited-interaction usage may overlook the complexities of real-world medical practice. In this work, we propose a hybrid human-AI framework, MedSyn, where physicians and LLMs engage in multi-step, interactive dialogues to refine diagnoses and treatment decisions. Unlike static decision-support tools, MedSyn enables dynamic exchanges, allowing physicians to challenge LLM suggestions while the LLM highlights alternative perspectives. Through simulated physician-LLM interactions, we assess the potential of open-source LLMs as physician assistants. Results show open-source LLMs are promising as physician assistants in the real world. Future work will involve real physician interactions to further validate MedSyn's usefulness in diagnostic accuracy and patient outcomes.
43.3AIMay 8
Human-LLM Dialogue Improves Diagnostic Accuracy in Emergency CareBurcu Sayin, Ngoc Vo Hong, Ipek Baris Schlicht et al.
Clinical decision-making in emergency medicine demands rapid, accurate diagnoses under uncertainty. Despite benchmark progress, evidence for LLMs as interactive aids in live physician workflows remains sparse. MedSyn lets physicians iteratively query an LLM provided with the full clinical record while initially viewing only the chief complaint. Seven physicians (three seniors, four residents) completed baseline and AI-assisted sessions across 52 MIMIC-IV cases stratified by difficulty. Blinded evaluation showed residents' Hard-case correctness rose from 0.589 to 0.734; difficulty-standardised completely-correct rates confirmed a medium effect (Δ = 0.092; p = 0.071; d = 0.47). Automated metrics corroborated these gains: standardised any-match accuracy improved by 0.156 (p < 0.0001), and residents showed the largest F1 gain (Δ = 0.138; p < 0.0001). Dialogue analysis revealed expertise-dependent strategies (seniors asked targeted, hypothesis-driven questions; residents relied on broader queries) and cross-expertise concordance increased (Δ = 0.145; p < 0.0001). Interactive LLM support meaningfully enhances diagnostic reasoning.
CLJan 24, 2025
Do LLMs Provide Consistent Answers to Health-Related Questions across Languages?Ipek Baris Schlicht, Zhixue Zhao, Burcu Sayin et al.
Equitable access to reliable health information is vital for public health, but the quality of online health resources varies by language, raising concerns about inconsistencies in Large Language Models (LLMs) for healthcare. In this study, we examine the consistency of responses provided by LLMs to health-related questions across English, German, Turkish, and Chinese. We largely expand the HealthFC dataset by categorizing health-related questions by disease type and broadening its multilingual scope with Turkish and Chinese translations. We reveal significant inconsistencies in responses that could spread healthcare misinformation. Our main contributions are 1) a multilingual health-related inquiry dataset with meta-information on disease categories, and 2) a novel prompt-based evaluation workflow that enables sub-dimensional comparisons between two languages through parsing. Our findings highlight key challenges in deploying LLM-based tools in multilingual contexts and emphasize the need for improved cross-lingual alignment to ensure accurate and equitable healthcare information.
CLApr 9, 2024
Pitfalls of Conversational LLMs on News DebiasingIpek Baris Schlicht, Defne Altiok, Maryanne Taouk et al.
This paper addresses debiasing in news editing and evaluates the effectiveness of conversational Large Language Models in this task. We designed an evaluation checklist tailored to news editors' perspectives, obtained generated texts from three popular conversational models using a subset of a publicly available dataset in media bias, and evaluated the texts according to the designed checklist. Furthermore, we examined the models as evaluator for checking the quality of debiased model outputs. Our findings indicate that none of the LLMs are perfect in debiasing. Notably, some models, including ChatGPT, introduced unnecessary changes that may impact the author's style and create misinformation. Lastly, we show that the models do not perform as proficiently as domain experts in evaluating the quality of debiased outputs.
CLOct 28, 2024
A Survey on Automatic Credibility Assessment Using Textual Credibility Signals in the Era of Large Language ModelsIvan Srba, Olesya Razuvayevskaya, João A. Leite et al.
In the age of social media and generative AI, the ability to automatically assess the credibility of online content has become increasingly critical, complementing traditional approaches to false information detection. Credibility assessment relies on aggregating diverse credibility signals - small units of information, such as content subjectivity, bias, or a presence of persuasion techniques - into a final credibility label/score. However, current research in automatic credibility assessment and credibility signals detection remains highly fragmented, with many signals studied in isolation and lacking integration. Notably, there is a scarcity of approaches that detect and aggregate multiple credibility signals simultaneously. These challenges are further exacerbated by the absence of a comprehensive and up-to-date overview of research works that connects these research efforts under a common framework and identifies shared trends, challenges, and open problems. In this survey, we address this gap by presenting a systematic and comprehensive literature review of 175 research papers, focusing on textual credibility signals within the field of Natural Language Processing (NLP), which undergoes a rapid transformation due to advancements in Large Language Models (LLMs). While positioning the NLP research into the the broader multidisciplinary landscape, we examine both automatic credibility assessment methods as well as the detection of nine categories of credibility signals. We provide an in-depth analysis of three key categories: 1) factuality, subjectivity and bias, 2) persuasion techniques and logical fallacies, and 3) check-worthy and fact-checked claims. In addition to summarising existing methods, datasets, and tools, we outline future research direction and emerging opportunities, with particular attention to evolving challenges posed by generative AI.
CLOct 20, 2025
Disparities in Multilingual LLM-Based Healthcare Q&AIpek Baris Schlicht, Burcu Sayin, Zhixue Zhao et al.
Equitable access to reliable health information is vital when integrating AI into healthcare. Yet, information quality varies across languages, raising concerns about the reliability and consistency of multilingual Large Language Models (LLMs). We systematically examine cross-lingual disparities in pre-training source and factuality alignment in LLM answers for multilingual healthcare Q&A across English, German, Turkish, Chinese (Mandarin), and Italian. We (i) constructed Multilingual Wiki Health Care (MultiWikiHealthCare), a multilingual dataset from Wikipedia; (ii) analyzed cross-lingual healthcare coverage; (iii) assessed LLM response alignment with these references; and (iv) conducted a case study on factual alignment through the use of contextual information and Retrieval-Augmented Generation (RAG). Our findings reveal substantial cross-lingual disparities in both Wikipedia coverage and LLM factual alignment. Across LLMs, responses align more with English Wikipedia, even when the prompts are non-English. Providing contextual excerpts from non-English Wikipedia at inference time effectively shifts factual alignment toward culturally relevant knowledge. These results highlight practical pathways for building more equitable, multilingual AI systems for healthcare.
IRDec 11, 2021
UPV at TREC Health Misinformation Track 2021 Ranking with SBERT and Quality EstimatorsIpek Baris Schlicht, Angel Felipe Magnossão de Paula, Paolo Rosso
Health misinformation on search engines is a significant problem that could negatively affect individuals or public health. To mitigate the problem, TREC organizes a health misinformation track. This paper presents our submissions to this track. We use a BM25 and a domain-specific semantic search engine for retrieving initial documents. Later, we examine a health news schema for quality assessment and apply it to re-rank documents. We merge the scores from the different components by using reciprocal rank fusion. Finally, we discuss the results and conclude with future works.
CLNov 8, 2021
Sexism Prediction in Spanish and English Tweets Using Monolingual and Multilingual BERT and Ensemble ModelsAngel Felipe Magnossão de Paula, Roberto Fray da Silva, Ipek Baris Schlicht
The popularity of social media has created problems such as hate speech and sexism. The identification and classification of sexism in social media are very relevant tasks, as they would allow building a healthier social environment. Nevertheless, these tasks are considerably challenging. This work proposes a system to use multilingual and monolingual BERT and data points translation and ensemble strategies for sexism identification and classification in English and Spanish. It was conducted in the context of the sEXism Identification in Social neTworks shared 2021 (EXIST 2021) task, proposed by the Iberian Languages Evaluation Forum (IberLEF). The proposed system and its main components are described, and an in-depth hyperparameters analysis is conducted. The main results observed were: (i) the system obtained better results than the baseline model (multilingual BERT); (ii) ensemble models obtained better results than monolingual models; and (iii) an ensemble model considering all individual models and the best standardized values obtained the best accuracies and F1-scores for both tasks. This work obtained first place in both tasks at EXIST, with the highest accuracies (0.780 for task 1 and 0.658 for task 2) and F1-scores (F1-binary of 0.780 for task 1 and F1-macro of 0.579 for task 2).
CLNov 8, 2021
AI-UPV at IberLEF-2021 DETOXIS task: Toxicity Detection in Immigration-Related Web News Comments Using Transformers and Statistical ModelsAngel Felipe Magnossão de Paula, Ipek Baris Schlicht
This paper describes our participation in the DEtection of TOXicity in comments In Spanish (DETOXIS) shared task 2021 at the 3rd Workshop on Iberian Languages Evaluation Forum. The shared task is divided into two related classification tasks: (i) Task 1: toxicity detection and; (ii) Task 2: toxicity level detection. They focus on the xenophobic problem exacerbated by the spread of toxic comments posted in different online news articles related to immigration. One of the necessary efforts towards mitigating this problem is to detect toxicity in the comments. Our main objective was to implement an accurate model to detect xenophobia in comments about web news articles within the DETOXIS shared task 2021, based on the competition's official metrics: the F1-score for Task 1 and the Closeness Evaluation Metric (CEM) for Task 2. To solve the tasks, we worked with two types of machine learning models: (i) statistical models and (ii) Deep Bidirectional Transformers for Language Understanding (BERT) models. We obtained our best results in both tasks using BETO, an BERT model trained on a big Spanish corpus. We obtained the 3rd place in Task 1 official ranking with the F1-score of 0.5996, and we achieved the 6th place in Task 2 official ranking with the CEM of 0.7142. Our results suggest: (i) BERT models obtain better results than statistical models for toxicity detection in text comments; (ii) Monolingual BERT models have an advantage over multilingual BERT models in toxicity detection in text comments in their pre-trained language.
CLSep 19, 2021
Unified and Multilingual Author Profiling for Detecting HatersIpek Baris Schlicht, Angel Felipe Magnossão de Paula
This paper presents a unified user profiling framework to identify hate speech spreaders by processing their tweets regardless of the language. The framework encodes the tweets with sentence transformers and applies an attention mechanism to select important tweets for learning user profiles. Furthermore, the attention layer helps to explain why a user is a hate speech spreader by producing attention weights at both token and post level. Our proposed model outperformed the state-of-the-art multilingual transformer models.
CLSep 19, 2021
UPV at CheckThat! 2021: Mitigating Cultural Differences for Identifying Multilingual Check-worthy ClaimsIpek Baris Schlicht, Angel Felipe Magnossão de Paula, Paolo Rosso
Identifying check-worthy claims is often the first step of automated fact-checking systems. Tackling this task in a multilingual setting has been understudied. Encoding inputs with multilingual text representations could be one approach to solve the multilingual check-worthiness detection. However, this approach could suffer if cultural bias exists within the communities on determining what is check-worthy.In this paper, we propose a language identification task as an auxiliary task to mitigate unintended bias.With this purpose, we experiment joint training by using the datasets from CLEF-2021 CheckThat!, that contain tweets in English, Arabic, Bulgarian, Spanish and Turkish. Our results show that joint training of language identification and check-worthy claim detection tasks can provide performance gains for some of the selected languages.
CLAug 8, 2021
Leveraging Commonsense Knowledge on Classifying False News and Determining Checkworthiness of ClaimsIpek Baris Schlicht, Erhan Sezerer, Selma Tekir et al.
Widespread and rapid dissemination of false news has made fact-checking an indispensable requirement. Given its time-consuming and labor-intensive nature, the task calls for an automated support to meet the demand. In this paper, we propose to leverage commonsense knowledge for the tasks of false news classification and check-worthy claim detection. Arguing that commonsense knowledge is a factor in human believability, we fine-tune the BERT language model with a commonsense question answering task and the aforementioned tasks in a multi-task learning environment. For predicting fine-grained false news types, we compare the proposed fine-tuned model's performance with the false news classification models on a public dataset as well as a newly collected dataset. We compare the model's performance with the single-task BERT model and a state-of-the-art check-worthy claim detection tool to evaluate the check-worthy claim detection. Our experimental analysis demonstrates that commonsense knowledge can improve performance in both tasks.