Anna Wróblewska

CL
h-index42
25papers
407citations
Novelty25%
AI Score40

25 Papers

CLJul 2, 2024
Fake News Detection: It's All in the Data!

Soveatin Kuntur, Anna Wróblewska, Marcin Paprzycki et al.

This comprehensive survey serves as an indispensable resource for researchers embarking on the journey of fake news detection. By highlighting the pivotal role of dataset quality and diversity, it underscores the significance of these elements in the effectiveness and robustness of detection models. The survey meticulously outlines the key features of datasets, various labeling systems employed, and prevalent biases that can impact model performance. Additionally, it addresses critical ethical issues and best practices, offering a thorough overview of the current state of available datasets. Our contribution to this field is further enriched by the provision of GitHub repository, which consolidates publicly accessible datasets into a single, user-friendly portal. This repository is designed to facilitate and stimulate further research and development efforts aimed at combating the pervasive issue of fake news.

LGJun 9, 2022
Does a Technique for Building Multimodal Representation Matter? -- Comparative Analysis

Maciej Pawłowski, Anna Wróblewska, Sylwia Sysko-Romańczuk

Creating a meaningful representation by fusing single modalities (e.g., text, images, or audio) is the core concept of multimodal learning. Although several techniques for building multimodal representations have been proven successful, they have not been compared yet. Therefore it has been ambiguous which technique can be expected to yield the best results in a given scenario and what factors should be considered while choosing such a technique. This paper explores the most common techniques for building multimodal data representations -- the late fusion, the early fusion, and the sketch, and compares them in classification tasks. Experiments are conducted on three datasets: Amazon Reviews, MovieLens25M, and MovieLens1M datasets. In general, our results confirm that multimodal representations are able to boost the performance of unimodal models from 0.919 to 0.969 of accuracy on Amazon Reviews and 0.907 to 0.918 of AUC on MovieLens25M. However, experiments on both MovieLens datasets indicate the importance of the meaningful input data to the given task. In this article, we show that the choice of the technique for building multimodal representation is crucial to obtain the highest possible model's performance, that comes with the proper modalities combination. Such choice relies on: the influence that each modality has on the analyzed machine learning (ML) problem; the type of the ML task; the memory constraints while training and predicting phase.

IRAug 10, 2022
Identifying Substitute and Complementary Products for Assortment Optimization with Cleora Embeddings

Sergiy Tkachuk, Anna Wróblewska, Jacek Dąbrowski et al.

Recent years brought an increasing interest in the application of machine learning algorithms in e-commerce, omnichannel marketing, and the sales industry. It is not only to the algorithmic advances but also to data availability, representing transactions, users, and background product information. Finding products related in different ways, i.e., substitutes and complements is essential for users' recommendations at the vendor's site and for the vendor - to perform efficient assortment optimization. The paper introduces a novel method for finding products' substitutes and complements based on the graph embedding Cleora algorithm. We also provide its experimental evaluation with regards to the state-of-the-art Shopper algorithm, studying the relevance of recommendations with surveys from industry experts. It is concluded that the new approach presented here offers suitable choices of recommended products, requiring a minimal amount of additional information. The algorithm can be used in various enterprises, effectively identifying substitute and complementary product options.

CLMay 31, 2022
Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish

Michał Możdżonek, Anna Wróblewska, Sergiy Tkachuk et al.

Product matching corresponds to the task of matching identical products across different data sources. It typically employs available product features which, apart from being multimodal, i.e., comprised of various data types, might be non-homogeneous and incomplete. The paper shows that pre-trained, multilingual Transformer models, after fine-tuning, are suitable for solving the product matching problem using textual features both in English and Polish languages. We tested multilingual mBERT and XLM-RoBERTa models in English on Web Data Commons - training dataset and gold standard for large-scale product matching. The obtained results show that these models perform similarly to the latest solutions tested on this set, and in some cases, the results were even better. Additionally, we prepared a new dataset entirely in Polish and based on offers in selected categories obtained from several online stores for the research purpose. It is the first open dataset for product matching tasks in Polish, which allows comparing the effectiveness of the pre-trained models. Thus, we also showed the baseline results obtained by the fine-tuned mBERT and XLM-RoBERTa models on the Polish datasets.

CVSep 21, 2023
Survey of Action Recognition, Spotting and Spatio-Temporal Localization in Soccer -- Current Trends and Research Perspectives

Karolina Seweryn, Anna Wróblewska, Szymon Łukasik

Action scene understanding in soccer is a challenging task due to the complex and dynamic nature of the game, as well as the interactions between players. This article provides a comprehensive overview of this task divided into action recognition, spotting, and spatio-temporal action localization, with a particular emphasis on the modalities used and multimodal methods. We explore the publicly available data sources and metrics used to evaluate models' performance. The article reviews recent state-of-the-art methods that leverage deep learning techniques and traditional methods. We focus on multimodal methods, which integrate information from multiple sources, such as video and audio data, and also those that represent one source in various ways. The advantages and limitations of methods are discussed, along with their potential for improving the accuracy and robustness of models. Finally, the article highlights some of the open research questions and future directions in the field of soccer action recognition, including the potential for multimodal methods to advance this field. Overall, this survey provides a valuable resource for researchers interested in the field of action scene understanding in soccer.

CLNov 28, 2022
Revisiting Distance Metric Learning for Few-Shot Natural Language Classification

Witold Sosnowski, Anna Wróblewska, Karolina Seweryn et al.

Distance Metric Learning (DML) has attracted much attention in image processing in recent years. This paper analyzes its impact on supervised fine-tuning language models for Natural Language Processing (NLP) classification tasks under few-shot learning settings. We investigated several DML loss functions in training RoBERTa language models on known SentEval Transfer Tasks datasets. We also analyzed the possibility of using proxy-based DML losses during model inference. Our systematic experiments have shown that under few-shot learning settings, particularly proxy-based DML losses can positively affect the fine-tuning and inference of a supervised language model. Models tuned with a combination of CCE (categorical cross-entropy loss) and ProxyAnchor Loss have, on average, the best performance and outperform models with only CCE by about 3.27 percentage points -- up to 10.38 percentage points depending on the training dataset.

CLNov 28, 2022
Distance Metric Learning Loss Functions in Few-Shot Scenarios of Supervised Language Models Fine-Tuning

Witold Sosnowski, Karolina Seweryn, Anna Wróblewska et al.

This paper presents an analysis regarding an influence of the Distance Metric Learning (DML) loss functions on the supervised fine-tuning of the language models for classification tasks. We experimented with known datasets from SentEval Transfer Tasks. Our experiments show that applying the DML loss function can increase performance on downstream classification tasks of RoBERTa-large models in few-shot scenarios. Models fine-tuned with the use of SoftTriple loss can achieve better results than models with a standard categorical cross-entropy loss function by about 2.89 percentage points from 0.04 to 13.48 percentage points depending on the training dataset. Additionally, we accomplished a comprehensive analysis with explainability techniques to assess the models' reliability and explain their results.

LGAug 7, 2024
Consumer Transactions Simulation through Generative Adversarial Networks

Sergiy Tkachuk, Szymon Łukasik, Anna Wróblewska

In the rapidly evolving domain of large-scale retail data systems, envisioning and simulating future consumer transactions has become a crucial area of interest. It offers significant potential to fortify demand forecasting and fine-tune inventory management. This paper presents an innovative application of Generative Adversarial Networks (GANs) to generate synthetic retail transaction data, specifically focusing on a novel system architecture that combines consumer behavior modeling with stock-keeping unit (SKU) availability constraints to address real-world assortment optimization challenges. We diverge from conventional methodologies by integrating SKU data into our GAN architecture and using more sophisticated embedding methods (e.g., hyper-graphs). This design choice enables our system to generate not only simulated consumer purchase behaviors but also reflects the dynamic interplay between consumer behavior and SKU availability -- an aspect often overlooked, among others, because of data scarcity in legacy retail simulation models. Our GAN model generates transactions under stock constraints, pioneering a resourceful experimental system with practical implications for real-world retail operation and strategy. Preliminary results demonstrate enhanced realism in simulated transactions measured by comparing generated items with real ones using methods employed earlier in related studies. This underscores the potential for more accurate predictive modeling.

CLMar 9, 2022
Automatic Language Identification for Celtic Texts

Olha Dovbnia, Anna Wróblewska

Language identification is an important Natural Language Processing task. It has been thoroughly researched in the literature. However, some issues are still open. This work addresses the identification of the related low-resource languages on the example of the Celtic language family. This work's main goals were: (1) to collect the dataset of three Celtic languages; (2) to prepare a method to identify the languages from the Celtic family, i.e. to train a successful classification model; (3) to evaluate the influence of different feature extraction methods, and explore the applicability of the unsupervised models as a feature extraction technique; (4) to experiment with the unsupervised feature extraction on a reduced annotated set. We collected a new dataset including Irish, Scottish, Welsh and English records. We tested supervised models such as SVM and neural networks with traditional statistical features alongside the output of clustering, autoencoder, and topic modelling methods. The analysis showed that the unsupervised features could serve as a valuable extension to the n-gram feature vectors. It led to an improvement in performance for more entangled classes. The best model achieved a 98\% F1 score and 97\% MCC. The dense neural network consistently outperformed the SVM model. The low-resource languages are also challenging due to the scarcity of available annotated training data. This work evaluated the performance of the classifiers using the unsupervised feature extraction on the reduced labelled dataset to handle this issue. The results uncovered that the unsupervised feature vectors are more robust to the labelled set reduction. Therefore, they proved to help achieve comparable classification performance with much less labelled data.

CLJan 27Code
On the Effectiveness of LLM-Specific Fine-Tuning for Detecting AI-Generated Text

Michał Gromadzki, Anna Wróblewska, Agnieszka Kaliska

The rapid progress of large language models has enabled the generation of text that closely resembles human writing, creating challenges for authenticity verification in education, publishing, and digital security. Detecting AI-generated text has therefore become a crucial technical and ethical issue. This paper presents a comprehensive study of AI-generated text detection based on large-scale corpora and novel training strategies. We introduce a 1-billion-token corpus of human-authored texts spanning multiple genres and a 1.9-billion-token corpus of AI-generated texts produced by prompting a variety of LLMs across diverse domains. Using these resources, we develop and evaluate numerous detection models and propose two novel training paradigms: Per LLM and Per LLM family fine-tuning. Across a 100-million-token benchmark covering 21 large language models, our best fine-tuned detector achieves up to $99.6\%$ token-level accuracy, substantially outperforming existing open-source baselines.

CLMay 10, 2023Code
Enriching language models with graph-based context information to better understand textual data

Albert Roethel, Maria Ganzha, Anna Wróblewska

A considerable number of texts encountered daily are somehow connected with each other. For example, Wikipedia articles refer to other articles via hyperlinks, scientific papers relate to others via citations or (co)authors, while tweets relate via users that follow each other or reshare content. Hence, a graph-like structure can represent existing connections and be seen as capturing the "context" of the texts. The question thus arises if extracting and integrating such context information into a language model might help facilitate a better automated understanding of the text. In this study, we experimentally demonstrate that incorporating graph-based contextualization into BERT model enhances its performance on an example of a classification task. Specifically, on Pubmed dataset, we observed a reduction in error from 8.51% to 7.96%, while increasing the number of parameters just by 1.6%. Our source code: https://github.com/tryptofanik/gc-bert

CLSep 2, 2022
Entity Graph Extraction from Legal Acts -- a Prototype for a Use Case in Policy Design Analysis

Anna Wróblewska, Bartosz Pieliński, Karolina Seweryn et al.

This paper presents research on a prototype developed to serve the quantitative study of public policy design. This sub-discipline of political science focuses on identifying actors, relations between them, and tools at their disposal in health, environmental, economic, and other policies. Our system aims to automate the process of gathering legal documents, annotating them with Institutional Grammar, and using hypergraphs to analyse inter-relations between crucial entities. Our system is tested against the UNESCO Convention for the Safeguarding of the Intangible Cultural Heritage from 2003, a legal document regulating essential aspects of international relations securing cultural heritage.

HCMay 7
From Surface Learning to Deep Understanding: A Grounded AI Tutoring System for Moodle

Anna Ostrowska, Michał Kukla, Gabriela Majstrak et al.

This demo paper describes the development of the AI Teaching \& Learning Assistant, a modular Moodle plugin that leverages Retrieval-Augmented Generation (RAG) to deliver high-quality, hallucination-free education. The system employs a dual-centric design, providing students with interactive, Socratic-based tutoring and educators with a "human-in-the-loop" workspace for supervised content generation. By grounding Large Language Model (LLM) responses in teacher-provided materials, the assistant addresses the risks of misinformation while encouraging deep conceptual mastery. Evaluation via the Ragas (LLM-as-a-Judge) framework and a preliminary user study confirms its effectiveness, achieving faithfulness scores up to 0.97 and a 4.00/5.00 recommendation rate.

CVJan 31, 2024
Improving Object Detection Quality in Football Through Super-Resolution Techniques

Karolina Seweryn, Gabriel Chęć, Szymon Łukasik et al.

This study explores the potential of super-resolution techniques in enhancing object detection accuracy in football. Given the sport's fast-paced nature and the critical importance of precise object (e.g. ball, player) tracking for both analysis and broadcasting, super-resolution could offer significant improvements. We investigate how advanced image processing through super-resolution impacts the accuracy and reliability of object detection algorithms in processing football match footage. Our methodology involved applying state-of-the-art super-resolution techniques to a diverse set of football match videos from SoccerNet, followed by object detection using Faster R-CNN. The performance of these algorithms, both with and without super-resolution enhancement, was rigorously evaluated in terms of detection accuracy. The results indicate a marked improvement in object detection accuracy when super-resolution preprocessing is applied. The improvement of object detection through the integration of super-resolution techniques yields significant benefits, especially for low-resolution scenarios, with a notable 12\% increase in mean Average Precision (mAP) at an IoU (Intersection over Union) range of 0.50:0.95 for 320x240 size images when increasing the resolution fourfold using RLFN. As the dimensions increase, the magnitude of improvement becomes more subdued; however, a discernible improvement in the quality of detection is consistently evident. Additionally, we discuss the implications of these findings for real-time sports analytics, player tracking, and the overall viewing experience. The study contributes to the growing field of sports technology by demonstrating the practical benefits and limitations of integrating super-resolution techniques in football analytics and broadcasting.

CLJan 3, 2025
Applying Text Mining to Analyze Human Question Asking in Creativity Research

Anna Wróblewska, Marceli Korbin, Yoed N. Kenett et al.

Creativity relates to the ability to generate novel and effective ideas in the areas of interest. How are such creative ideas generated? One possible mechanism that supports creative ideation and is gaining increased empirical attention is by asking questions. Question asking is a likely cognitive mechanism that allows defining problems, facilitating creative problem solving. However, much is unknown about the exact role of questions in creativity. This work presents an attempt to apply text mining methods to measure the cognitive potential of questions, taking into account, among others, (a) question type, (b) question complexity, and (c) the content of the answer. This contribution summarizes the history of question mining as a part of creativity research, along with the natural language processing methods deemed useful or helpful in the study. In addition, a novel approach is proposed, implemented, and applied to five datasets. The experimental results obtained are comprehensively analyzed, suggesting that natural language processing has a role to play in creative research.

CLMay 10, 2025
Evaluating LLM-Generated Q&A Test: a Student-Centered Study

Anna Wróblewska, Bartosz Grabek, Jakub Świstak et al.

This research prepares an automatic pipeline for generating reliable question-answer (Q&A) tests using AI chatbots. We automatically generated a GPT-4o-mini-based Q&A test for a Natural Language Processing course and evaluated its psychometric and perceived-quality metrics with students and experts. A mixed-format IRT analysis showed that the generated items exhibit strong discrimination and appropriate difficulty, while student and expert star ratings reflect high overall quality. A uniform DIF check identified two items for review. These findings demonstrate that LLM-generated assessments can match human-authored tests in psychometric performance and user satisfaction, illustrating a scalable approach to AI-assisted assessment development.

AIJun 20, 2024
Intelligent Interface: Enhancing Lecture Engagement with Didactic Activity Summaries

Anna Wróblewska, Marcel Witas, Kinga Frańczak et al.

Recently, multiple applications of machine learning have been introduced. They include various possibilities arising when image analysis methods are applied to, broadly understood, video streams. In this context, a novel tool, developed for academic educators to enhance the teaching process by automating, summarizing, and offering prompt feedback on conducting lectures, has been developed. The implemented prototype utilizes machine learning-based techniques to recognise selected didactic and behavioural teachers' features within lecture video recordings. Specifically, users (teachers) can upload their lecture videos, which are preprocessed and analysed using machine learning models. Next, users can view summaries of recognized didactic features through interactive charts and tables. Additionally, stored ML-based prediction results support comparisons between lectures based on their didactic content. In the developed application text-based models trained on lecture transcriptions, with enhancements to the transcription quality, by adopting an automatic speech recognition solution are applied. Furthermore, the system offers flexibility for (future) integration of new/additional machine-learning models and software modules for image and video analysis.

CLJun 19, 2024
Mining United Nations General Assembly Debates

Mateusz Grzyb, Mateusz Krzyziński, Bartłomiej Sobieski et al.

This project explores the application of Natural Language Processing (NLP) techniques to analyse United Nations General Assembly (UNGA) speeches. Using NLP allows for the efficient processing and analysis of large volumes of textual data, enabling the extraction of semantic patterns, sentiment analysis, and topic modelling. Our goal is to deliver a comprehensive dataset and a tool (interface with descriptive statistics and automatically extracted topics) from which political scientists can derive insights into international relations and have the opportunity to have a nuanced understanding of global diplomatic discourse.

CLMay 26, 2023
Automating the Analysis of Institutional Design in International Agreements

Anna Wróblewska, Bartosz Pieliński, Karolina Seweryn et al.

This paper explores the automatic knowledge extraction of formal institutional design - norms, rules, and actors - from international agreements. The focus was to analyze the relationship between the visibility and centrality of actors in the formal institutional design in regulating critical aspects of cultural heritage relations. The developed tool utilizes techniques such as collecting legal documents, annotating them with Institutional Grammar, and using graph analysis to explore the formal institutional design. The system was tested against the 2003 UNESCO Convention for the Safeguarding of the Intangible Cultural Heritage.

CLJan 10, 2022
Polish Natural Language Inference and Factivity -- an Expert-based Dataset and Benchmarks

Daniel Ziembicki, Anna Wróblewska, Karolina Seweryn

Despite recent breakthroughs in Machine Learning for Natural Language Processing, the Natural Language Inference (NLI) problems still constitute a challenge. To this purpose we contribute a new dataset that focuses exclusively on the factivity phenomenon; however, our task remains the same as other NLI tasks, i.e. prediction of entailment, contradiction or neutral (ECN). The dataset contains entirely natural language utterances in Polish and gathers 2,432 verb-complement pairs and 309 unique verbs. The dataset is based on the National Corpus of Polish (NKJP) and is a representative sample in regards to frequency of main verbs and other linguistic features (e.g. occurrence of internal negation). We found that transformer BERT-based models working on sentences obtained relatively good results ($\approx89\%$ F1 score). Even though better results were achieved using linguistic features ($\approx91\%$ F1 score), this model requires more human labour (humans in the loop) because features were prepared manually by expert linguists. BERT-based models consuming only the input sentences show that they capture most of the complexity of NLI/factivity. Complex cases in the phenomenon - e.g. cases with entitlement (E) and non-factive verbs - remain an open issue for further research.

CLDec 24, 2021
Spoiler in a Textstack: How Much Can Transformers Help?

Anna Wróblewska, Paweł Rzepiński, Sylwia Sysko-Romańczuk

This paper presents our research regarding spoiler detection in reviews. In this use case, we describe the method of fine-tuning and organizing the available text-based model tasks with the latest deep learning achievements and techniques to interpret the models' results. Until now, spoiler research has been rarely described in the literature. We tested the transfer learning approach and different latest transformer architectures on two open datasets with annotated spoilers (ROC AUC above 81\% on TV Tropes Movies dataset, and Goodreads dataset above 88\%). We also collected data and assembled a new dataset with fine-grained annotations. To that end, we employed interpretability techniques and measures to assess the models' reliability and explain their results.

CLOct 4, 2021
Protagonists' Tagger in Literary Domain -- New Datasets and a Method for Person Entity Linkage

Weronika Łajewska, Anna Wróblewska

Semantic annotation of long texts, such as novels, remains an open challenge in Natural Language Processing (NLP). This research investigates the problem of detecting person entities and assigning them unique identities, i.e., recognizing people (especially main characters) in novels. We prepared a method for person entity linkage (named entity recognition and disambiguation) and new testing datasets. The datasets comprise 1,300 sentences from 13 classic novels of different genres that a novel reader had manually annotated. Our process of identifying literary characters in a text, implemented in protagonistTagger, comprises two stages: (1) named entity recognition (NER) of persons, (2) named entity disambiguation (NED) - matching each recognized person with the literary character's full name, based on approximate text matching. The protagonistTagger achieves both precision and recall of above 83% on the prepared testing sets. Finally, we gathered a corpus of 13 full-text novels tagged with protagonistTagger that comprises more than 35,000 mentions of literary characters.

CLMay 12, 2021
Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts

Tomasz Stanisławek, Filip Graliński, Anna Wróblewska et al.

The relevance of the Key Information Extraction (KIE) task is increasingly important in natural language processing problems. But there are still only a few well-defined problems that serve as benchmarks for solutions in this area. To bridge this gap, we introduce two new datasets (Kleister NDA and Kleister Charity). They involve a mix of scanned and born-digital long formal English-language documents. In these datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister Charity dataset consists of 2,788 annual financial reports of charity organizations, with 61,643 unique pages and 21,612 entities to extract. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract. We provide several state-of-the-art baseline systems from the KIE domain (Flair, BERT, RoBERTa, LayoutLM, LAMBERT), which show that our datasets pose a strong challenge to existing models. The best model achieved an 81.77% and an 83.57% F1-score on respectively the Kleister NDA and the Kleister Charity datasets. We share the datasets to encourage progress on more in-depth and complex information extraction tasks.

CLMar 4, 2020
Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout

Filip Graliński, Tomasz Stanisławek, Anna Wróblewska et al.

State-of-the-art solutions for Natural Language Processing (NLP) are able to capture a broad range of contexts, like the sentence-level context or document-level context for short documents. But these solutions are still struggling when it comes to longer, real-world documents with the information encoded in the spatial structure of the document, such as page elements like tables, forms, headers, openings or footers; complex page layout or presence of multiple pages. To encourage progress on deeper and more complex Information Extraction (IE) we introduce a new task (named Kleister) with two new datasets. Utilizing both textual and structural layout features, an NLP system must find the most important information, about various types of entities, in long formal documents. We propose Pipeline method as a text-only baseline with different Named Entity Recognition architectures (Flair, BERT, RoBERTa). Moreover, we checked the most popular PDF processing tools for text extraction (pdf2djvu, Tesseract and Textract) in order to analyze behavior of IE system in presence of errors introduced by these tools.

CLOct 6, 2019
Named Entity Recognition -- Is there a glass ceiling?

Tomasz Stanislawek, Anna Wróblewska, Alicja Wójcicka et al.

Recent developments in Named Entity Recognition (NER) have resulted in better and better models. However, is there a glass ceiling? Do we know which types of errors are still hard or even impossible to correct? In this paper, we present a detailed analysis of the types of errors in state-of-the-art machine learning (ML) methods. Our study reveals the weak and strong points of the Stanford, CMU, FLAIR, ELMO and BERT models, as well as their shared limitations. We also introduce new techniques for improving annotation, for training processes and for checking a model's quality and stability. Presented results are based on the CoNLL 2003 data set for the English language. A new enriched semantic annotation of errors for this data set and new diagnostic data sets are attached in the supplementary materials.