Maria Ganzha

CL
h-index42
19papers
63citations
Novelty29%
AI Score44

19 Papers

CLJul 2, 2024
Fake News Detection: It's All in the Data!

Soveatin Kuntur, Anna Wróblewska, Marcin Paprzycki et al.

This comprehensive survey serves as an indispensable resource for researchers embarking on the journey of fake news detection. By highlighting the pivotal role of dataset quality and diversity, it underscores the significance of these elements in the effectiveness and robustness of detection models. The survey meticulously outlines the key features of datasets, various labeling systems employed, and prevalent biases that can impact model performance. Additionally, it addresses critical ethical issues and best practices, offering a thorough overview of the current state of available datasets. Our contribution to this field is further enriched by the provision of GitHub repository, which consolidates publicly accessible datasets into a single, user-friendly portal. This repository is designed to facilitate and stimulate further research and development efforts aimed at combating the pervasive issue of fake news.

LGJul 8, 2022
StatMix: Data augmentation method that relies on image statistics in federated learning

Dominik Lewy, Jacek Mańdziuk, Maria Ganzha et al.

Availability of large amount of annotated data is one of the pillars of deep learning success. Although numerous big datasets have been made available for research, this is often not the case in real life applications (e.g. companies are not able to share data due to GDPR or concerns related to intellectual property rights protection). Federated learning (FL) is a potential solution to this problem, as it enables training a global model on data scattered across multiple nodes, without sharing local data itself. However, even FL methods pose a threat to data privacy, if not handled properly. Therefore, we propose StatMix, an augmentation approach that uses image statistics, to improve results of FL scenario(s). StatMix is empirically tested on CIFAR-10 and CIFAR-100, using two neural network architectures. In all FL experiments, application of StatMix improves the average accuracy, compared to the baseline training (with no use of StatMix). Some improvement can also be observed in non-FL setups.

DBNov 24, 2023
RDF Stream Taxonomy: Systematizing RDF Stream Types in Research and Practice

Piotr Sowinski, Pawel Szmeja, Maria Ganzha et al.

Over the years, RDF streaming was explored in research and practice from many angles, resulting in a wide range of RDF stream definitions. This variety presents a major challenge in discussing and integrating streaming systems, due to the lack of a common language. This work attempts to address this critical research gap, by systematizing RDF stream types present in the literature in a novel taxonomy. The proposed RDF Stream Taxonomy (RDF-STaX) is embodied in an OWL 2 DL ontology that follows the FAIR principles, making it readily applicable in practice. Extensive documentation and additional resources are provided, to foster the adoption of the ontology. Three use cases for the ontology are presented with accompanying competency questions, demonstrating the usefulness of the resource. Additionally, this work introduces a novel nanopublications dataset, which serves as a collaborative, living state-of-the-art review of RDF streaming. The results of a multifaceted evaluation of the resource are presented, testing its logical validity, use case coverage, and adherence to the community's best practices, while also comparing it to other works. RDF-STaX is expected to help drive innovation in RDF streaming, by fostering scientific discussion, cooperation, and tool interoperability.

CVJan 4, 2023
Towards Edge-Cloud Architectures for Personal Protective Equipment Detection

Jaroslaw Legierski, Kajetan Rachwal, Piotr Sowinski et al.

Detecting Personal Protective Equipment in images and video streams is a relevant problem in ensuring the safety of construction workers. In this contribution, an architecture enabling live image recognition of such equipment is proposed. The solution is deployable in two settings -- edge-cloud and edge-only. The system was tested on an active construction site, as a part of a larger scenario, within the scope of the ASSIST-IoT H2020 project. To determine the feasibility of the edge-only variant, a model for counting people wearing safety helmets was developed using the YOLOX method. It was found that an edge-only deployment is possible for this use case, given the hardware infrastructure available on site. In the preliminary evaluation, several important observations were made, that are crucial to the further development and deployment of the system. Future work will include an in-depth investigation of performance aspects of the two architecture variants.

SEMay 5, 2022
Ontology Reuse: the Real Test of Ontological Design

Piotr Sowinski, Katarzyna Wasielewska-Michniewska, Maria Ganzha et al.

Reusing ontologies in practice is still very challenging, especially when multiple ontologies are (jointly) involved. Moreover, despite recent advances, the realization of systematic ontology quality assurance remains a difficult problem. In this work, the quality of thirty biomedical ontologies, and the Computer Science Ontology are investigated, from the perspective of a practical use case. Special scrutiny is given to cross-ontology references, which are vital for combining ontologies. Diverse methods to detect potential issues are proposed, including natural language processing and network analysis. Moreover, several suggestions for improving ontologies and their quality assurance processes are presented. It is argued that while the advancing automatic tools for ontology quality assurance are crucial for ontology improvement, they will not solve the problem entirely. It is ontology reuse that is the ultimate method for continuously verifying and improving ontology quality, as well as for guiding its future development. Specifically, multiple issues can be found and fixed primarily through practical and diverse ontology reuse scenarios.

LGApr 9, 2022
Applying machine learning to predict behavior of bus transport in Warsaw, Poland

Łukasz Pałys, Maria Ganzha, Marcin Paprzycki

Nowadays, it is possible to collect precise data describing movements of public transport. Specifically, for each bus (or tram) geoposition data can be regularly collected. This includes data for all buses in Warsaw, Poland. Moreover, this data can be downloaded and analyzed. In this context, one of the simplest questions is: can a model be build to represent behavior of busses, and predict their delays. This work provides initial results of our attempt to answer this question.

LGJul 15, 2022
Introducing Federated Learning into Internet of Things ecosystems -- preliminary considerations

Karolina Bogacka, Katarzyna Wasielewska-Michniewska, Marcin Paprzycki et al.

Federated learning (FL) was proposed to facilitate the training of models in a distributed environment. It supports the protection of (local) data privacy and uses local resources for model training. Until now, the majority of research has been devoted to "core issues", such as adaptation of machine learning algorithms to FL, data privacy protection, or dealing with the effects of uneven data distribution between clients. This contribution is anchored in a practical use case, where FL is to be actually deployed within an Internet of Things ecosystem. Hence, somewhat different issues that need to be considered, beyond popular considerations found in the literature, are identified. Moreover, an architecture that enables the building of flexible, and adaptable, FL solutions is introduced.

LGJun 16, 2022
Using adversarial images to improve outcomes of federated learning for non-IID data

Anastasiya Danilenka, Maria Ganzha, Marcin Paprzycki et al.

One of the important problems in federated learning is how to deal with unbalanced data. This contribution introduces a novel technique designed to deal with label skewed non-IID data, using adversarial inputs, created by the I-FGSM method. Adversarial inputs guide the training process and allow the Weighted Federated Averaging to give more importance to clients with 'selected' local label distributions. Experimental results, gathered from image classification tasks, for MNIST and CIFAR-10 datasets, are reported and analyzed.

LGMar 29, 2022
Practical Aspects of Zero-Shot Learning

Elie Saad, Marcin Paprzycki, Maria Ganzha

One of important areas of machine learning research is zero-shot learning. It is applied when properly labeled training data set is not available. A number of zero-shot algorithms have been proposed and experimented with. However, none of them seems to be the "overall winner". In situations like this, it may be possible to develop a meta-classifier that would combine "best aspects" of individual classifiers and outperform all of them. In this context, the goal of this contribution is twofold. First, multiple state-of-the-art zero-shot learning methods are compared for standard benchmark datasets. Second, multiple meta-classifiers are suggested and experimentally compared (for the same datasets).

CLMay 10, 2023Code
Enriching language models with graph-based context information to better understand textual data

Albert Roethel, Maria Ganzha, Anna Wróblewska

A considerable number of texts encountered daily are somehow connected with each other. For example, Wikipedia articles refer to other articles via hyperlinks, scientific papers relate to others via citations or (co)authors, while tweets relate via users that follow each other or reshare content. Hence, a graph-like structure can represent existing connections and be seen as capturing the "context" of the texts. The question thus arises if extracting and integrating such context information into a language model might help facilitate a better automated understanding of the text. In this study, we experimentally demonstrate that incorporating graph-based contextualization into BERT model enhances its performance on an example of a classification task. Specifically, on Pubmed dataset, we observed a reduction in error from 8.51% to 7.96%, while increasing the number of parameters just by 1.6%. Our source code: https://github.com/tryptofanik/gc-bert

IROct 5, 2021Code
Exploring usability of Reddit in data science and knowledge processing

Jan Sawicki, Maria Ganzha, Marcin Paprzycki et al.

This contribution argues that Reddit, as a massive, categorized, open-access dataset, is a useful data source, for "almost any topic". Hence, it can be used in data science, e.g. for knowledge exploration. This statement is backed-up with presented analysis, based on 180 manually annotated papers, related to Reddit itself, and data acquired from popular databases of scientific papers. Finally, an open source tool is introduced, which provides an easy access to Reddit resources, and an exploratory data analysis of how Reddit covers selected topics. These functions can be used as a prelude analysis to a broader exploration of Reddit's applicability.

CLJan 3, 2025
Applying Text Mining to Analyze Human Question Asking in Creativity Research

Anna Wróblewska, Marceli Korbin, Yoed N. Kenett et al.

Creativity relates to the ability to generate novel and effective ideas in the areas of interest. How are such creative ideas generated? One possible mechanism that supports creative ideation and is gaining increased empirical attention is by asking questions. Question asking is a likely cognitive mechanism that allows defining problems, facilitating creative problem solving. However, much is unknown about the exact role of questions in creativity. This work presents an attempt to apply text mining methods to measure the cognitive potential of questions, taking into account, among others, (a) question type, (b) question complexity, and (c) the content of the answer. This contribution summarizes the history of question mining as a part of creativity research, along with the natural language processing methods deemed useful or helpful in the study. In addition, a novel approach is proposed, implemented, and applied to five datasets. The experimental results obtained are comprehensively analyzed, suggesting that natural language processing has a role to play in creative research.

CLApr 9
Graph Neural Networks for Misinformation Detection: Performance-Efficiency Trade-offs

Soveatin Kuntur, Maciej Krzywda, Anna Wróblewska et al.

The rapid spread of online misinformation has led to increasingly complex detection models, including large language models and hybrid architectures. However, their computational cost and deployment limitations raise concerns about practical applicability. In this work, we benchmark graph neural networks (GNNs) against non-graph-based machine learning methods under controlled and comparable conditions. We evaluate lightweight GNN architectures (GCN, GraphSAGE, GAT, ChebNet) against Logistic Regression, Support Vector Machines, and Multilayer Perceptrons across seven public datasets in English, Indonesian, and Polish. All models use identical TF-IDF features to isolate the impact of relational structure. Performance is measured using F1 score, with inference time reported to assess efficiency. GNNs consistently outperform non-graph baselines across all datasets. For example, GraphSAGE achieves 96.8% F1 on Kaggle and 91.9% on WELFake, compared to 73.2% and 66.8% for MLP, respectively. On COVID-19, GraphSAGE reaches 90.5% F1 vs. 74.9%, while ChebNet attains 79.1% vs. 66.4% on FakeNewsNet. These gains are achieved with comparable or lower inference times. Overall, the results show that classic GNNs remain effective and efficient, challenging the need for increasingly complex architectures in misinformation detection.

CLApr 9
Clickbait detection: quick inference with maximum impact

Soveatin Kuntur, Panggih Kusuma Ningrum, Anna Wróblewska et al.

We propose a lightweight hybrid approach to clickbait detection that combines OpenAI semantic embeddings with six compact heuristic features capturing stylistic and informational cues. To improve efficiency, embeddings are reduced using PCA and evaluated with XGBoost, GraphSAGE, and GCN classifiers. While the simplified feature design yields slightly lower F1-scores, graph-based models achieve competitive performance with substantially reduced inference time. High ROC--AUC values further indicate strong discrimination capability, supporting reliable detection of clickbait headlines under varying decision thresholds.

MASep 2, 2025
Contemporary Agent Technology: LLM-Driven Advancements vs Classic Multi-Agent Systems

Costin Bădică, Amelia Bădică, Maria Ganzha et al.

This contribution provides our comprehensive reflection on the contemporary agent technology, with a particular focus on the advancements driven by Large Language Models (LLM) vs classic Multi-Agent Systems (MAS). It delves into the models, approaches, and characteristics that define these new systems. The paper emphasizes the critical analysis of how the recent developments relate to the foundational MAS, as articulated in the core academic literature. Finally, it identifies key challenges and promising future directions in this rapidly evolving domain.

LGFeb 15, 2025
ReReLRP -- Remembering and Recognizing Tasks with LRP

Karolina Bogacka, Maximilian Höfler, Maria Ganzha et al.

Deep neural networks have revolutionized numerous research fields and applications. Despite their widespread success, a fundamental limitation known as catastrophic forgetting remains, where models fail to retain their ability to perform previously learned tasks after being trained on new ones. This limitation is particularly acute in certain continual learning scenarios, where models must integrate the knowledge from new domains with their existing capabilities. Traditional approaches to mitigate this problem typically rely on memory replay mechanisms, storing either original data samples, prototypes, or activation patterns. Although effective, these methods often introduce significant computational overhead, raise privacy concerns, and require the use of dedicated architectures. In this work we present ReReLRP (Remembering and Recognizing with LRP), a novel solution that leverages Layerwise Relevance Propagation (LRP) to preserve information across tasks. Our contribution provides increased privacy of existing replay-free methods while additionally offering built-in explainability, flexibility of model architecture and deployment, and a new mechanism to increase memory storage efficiency. We validate our approach on a wide variety of datasets, demonstrating results comparable with a well-known replay-based method in selected scenarios.

MANov 25, 2025
EnergyTwin: A Multi-Agent System for Simulating and Coordinating Energy Microgrids

Jakub Muszyński, Ignacy Walużenicz, Patryk Zan et al.

Microgrids are deployed to reduce purchased grid energy, limit exposure to volatile tariffs, and ensure service continuity during disturbances. This requires coordinating heterogeneous distributed energy resources across multiple time scales and under variable conditions. Among existing tools, typically, power-system simulators capture physical behaviour but assume centralized control, while multi-agent frameworks model decentralized decision-making but represent energy with no physical grounding. In this context, the EnergyTwin is introduced, an agent-based microgrid simulation environment that couples physically grounded models with forecast-informed, rolling-horizon planning, and negotiations. Each asset is modeled as an agent, interacting with a central agent that obtains forecasts, formulates predictions, and allocates energy through contract-based interactions. EnergyTwin targets tertiary-layer decision making and is extensible for digital-twin use. Its feasibility was evaluated in a university campus microgrid scenario where multiple planning strategies were compared. Achieved results show that forecast-driven rolling-horizon planning increases local energy self-sufficiency, maintains higher battery reserves, and reduces exposure to low-resilience operating states. They demonstrate also potential of EnergyTwin as platform supporting research on resilient, negotiation-driven microgrids.

CLJan 2, 2022
Topical Classification of Food Safety Publications with a Knowledge Base

Piotr Sowinski, Katarzyna Wasielewska-Michniewska, Maria Ganzha et al.

The vast body of scientific publications presents an increasing challenge of finding those that are relevant to a given research question, and making informed decisions on their basis. This becomes extremely difficult without the use of automated tools. Here, one possible area for improvement is automatic classification of publication abstracts according to their topic. This work introduces a novel, knowledge base-oriented publication classifier. The proposed method focuses on achieving scalability and easy adaptability to other domains. Classification speed and accuracy are shown to be satisfactory, in the very demanding field of food safety. Further development and evaluation of the method is needed, as the proposed approach shows much potential.

CRJan 8, 2021
Semantic Access Control for Privacy Management of Personal Sensing in Smart Cities

Michał Drozdowicz, Maria Ganzha, Marcin Paprzycki

Personal and home sensors generate valuable information that could be used in Smart Cities. Unfortunately, typically, this data is locked out and used only by application/system developer. While vendors are to blame, one should consider also the "binary nature" of data access. Specifically, either owner has full control over her data (e.g. in a "closed system"), or she completely looses control, when the data is "opened". In this context, we propose, a semantic technologies-based, authorization and privacy control framework that enables user to maintain flexible, yet manageable data access control policies. The proposed approach is described in detail, including implementation and testing.