2.7LGMay 29
How well does Classification Accuracy capture Concept Drift Detection Quality? An overview of Concept Drift Detection evaluationJoanna Komorniczak
Data streams are nowadays among the most frequently analyzed data structures, with the concept drift posing a major challenge encountered by processing systems. Despite the proposition of numerous solutions to counteract the accuracy degeneration due to concept drift, the scientific community has not yet established a unified framework for evaluating the concept drift detection task. Existing research often relies on classification quality metrics, but these can be affected by multiple factors and may not reliably reflect drift detection quality. In this work, we present an in-depth overview of the relationship between metrics for quantifying drift detection quality and classification performance in synthetic nonstationary data streams. The proposed research studies eight drift detection quality metrics in relation to the classifier's performance across seven synthetic data stream generation tools, additionally considering drift dynamics as a factor. The studies aim to identify the most informative set of drift detection quality metrics and provide a deep understanding of the method's evaluation.
3.7LGMay 28
Open World Autoencoding Drift Detection with Novel Class Recognition in Tabular Non-stationary Data StreamsJoanna Komorniczak
Data stream processing has become a landmark in modern machine learning applications, with concept drifts and novel class appearances posing the primary challenges faced by sophisticated recognition methods. This work proposes an unsupervised concept drift detection method that identifies shifts in known class distributions based on the reconstruction errors of an autoencoder, while also enabling the recognition of novel class samples through density estimation of a proxy representation of samples. Using mirrored autoencoders allows for independent incremental adaptation to changing problem distributions for the two considered tasks, resulting in continuous adjustment to evolving concepts and reliable recognition of unknown samples. Conducted experiments used a diverse set of synthetic tabular data streams, where both concept drifts and the emergence of novelties were observed. The results show that the proposed approach is competitive with current state-of-the-art unsupervised drift detectors and novelty classifiers.
LGJul 14, 2022
problexity -- an open-source Python library for binary classification problem complexity assessmentJoanna Komorniczak, Pawel Ksieniewicz
The classification problem's complexity assessment is an essential element of many topics in the supervised learning domain. It plays a significant role in meta-learning -- becoming the basis for determining meta-attributes or multi-criteria optimization -- allowing the evaluation of the training set resampling without needing to rebuild the recognition model. The tools currently available for the academic community, which would enable the calculation of problem complexity measures, are available only as libraries of the C++ and R languages. This paper describes the software module that allows for the estimation of 22 complexity measures for the Python language -- compatible with the scikit-learn programming interface -- allowing for the implementation of research using them in the most popular programming environment of the machine learning community.
LGMay 16, 2023Code
torchosr -- a PyTorch extension package for Open Set Recognition models evaluation in PythonJoanna Komorniczak, Pawel Ksieniewicz
The article presents the torchosr package - a Python package compatible with PyTorch library - offering tools and methods dedicated to Open Set Recognition in Deep Neural Networks. The package offers two state-of-the-art methods in the field, a set of functions for handling base sets and generation of derived sets for the Open Set Recognition task (where some classes are considered unknown and used only in the testing process) and additional tools to handle datasets and methods. The main goal of the package proposal is to simplify and promote the correct experimental evaluation, where experiments are carried out on a large number of derivative sets with various Openness and class-to-category assignments. The authors hope that state-of-the-art methods available in the package will become a source of a correct and open-source implementation of the relevant solutions in the domain.
9.0LGApr 10
Synthesizing real-world distributions from high-dimensional Gaussian Noise with Fully Connected Neural NetworkJoanna Komorniczak
The use of synthetic data in machine learning applications and research offers many benefits, including performance improvements through data augmentation, privacy preservation of original samples, and reliable method assessment with fully synthetic data. This work proposes a time-efficient synthetic data generation method based on a fully connected neural network and a randomized loss function that transforms a random Gaussian distribution to approximate a target real-world dataset. The experiments conducted on 25 diverse tabular real-world datasets confirm that the proposed solution surpasses the state-of-the-art generative methods and achieves reference MMD scores orders of magnitude faster than modern deep learning solutions. The experiments involved analyzing distributional similarity, assessing the impact on classification quality, and using PCA for dimensionality reduction, which further enhances data privacy and can boost classification quality while reducing time and memory complexity.
LGApr 11, 2024
Unsupervised Concept Drift Detection based on Parallel Activations of Neural NetworkJoanna Komorniczak, Paweł Ksieniewicz
Practical applications of artificial intelligence increasingly often have to deal with the streaming properties of real data, which, considering the time factor, are subject to phenomena such as periodicity and more or less chaotic degeneration - resulting directly in the concept drifts. The modern concept drift detectors almost always assume immediate access to labels, which due to their cost, limited availability and possible delay has been shown to be unrealistic. This work proposes an unsupervised Parallel Activations Drift Detector, utilizing the outputs of an untrained neural network, presenting its key design elements, intuitions about processing properties, and a pool of computer experiments demonstrating its competitiveness with state-of-the-art methods.
LGFeb 9, 2024
Taking Class Imbalance Into Account in Open Set Recognition EvaluationJoanna Komorniczak, Pawel Ksieniewicz
In recent years Deep Neural Network-based systems are not only increasing in popularity but also receive growing user trust. However, due to the closed-world assumption of such systems, they cannot recognize samples from unknown classes and often induce an incorrect label with high confidence. Presented work looks at the evaluation of methods for Open Set Recognition, focusing on the impact of class imbalance, especially in the dichotomy between known and unknown samples. As an outcome of problem analysis, we present a set of guidelines for evaluation of methods in this field.
LGJul 20, 2025
Transforming Datasets to Requested Complexity with Projection-based Many-Objective Genetic AlgorithmJoanna Komorniczak
The research community continues to seek increasingly more advanced synthetic data generators to reliably evaluate the strengths and limitations of machine learning methods. This work aims to increase the availability of datasets encompassing a diverse range of problem complexities by proposing a genetic algorithm that optimizes a set of problem complexity measures for classification and regression tasks towards specific targets. For classification, a set of 10 complexity measures was used, while for regression tasks, 4 measures demonstrating promising optimization capabilities were selected. Experiments confirmed that the proposed genetic algorithm can generate datasets with varying levels of difficulty by transforming synthetically created datasets to achieve target complexity values through linear feature projections. Evaluations involving state-of-the-art classifiers and regressors revealed a correlation between the complexity of the generated data and the recognition quality.
LGMay 19, 2025
Synthetic Non-stationary Data Streams for Recognition of the UnknownJoanna Komorniczak
The problem of data non-stationarity is commonly addressed in data stream processing. In a dynamic environment, methods should continuously be ready to analyze time-varying data -- hence, they should enable incremental training and respond to concept drifts. An equally important variability typical for non-stationary data stream environments is the emergence of new, previously unknown classes. Often, methods focus on one of these two phenomena -- detection of concept drifts or detection of novel classes -- while both difficulties can be observed in data streams. Additionally, concerning previously unknown observations, the topic of open set of classes has become particularly important in recent years, where the goal of methods is to efficiently classify within known classes and recognize objects outside the model competence. This article presents a strategy for synthetic data stream generation in which both concept drifts and the emergence of new classes representing unknown objects occur. The presented research shows how unsupervised drift detectors address the task of detecting novelty and concept drifts and demonstrates how the generated data streams can be utilized in the open set recognition task.
LGFeb 7, 2025
Describing Nonstationary Data Streams in Frequency DomainJoanna Komorniczak
Concept drift is among the primary challenges faced by the data stream processing methods. The drift detection strategies, designed to counteract the negative consequences of such changes, often rely on analyzing the problem metafeatures. This work presents the Frequency Filtering Metadescriptor -- a tool for characterizing the data stream that searches for the informative frequency components visible in the sample's feature vector. The frequencies are filtered according to their variance across all available data batches. The presented solution is capable of generating a metadescription of the data stream, separating chunks into groups describing specific concepts on its basis, and visualizing the frequencies in the original spatial domain. The experimental analysis compared the proposed solution with two state-of-the-art strategies and with the PCA baseline in the post-hoc concept identification task. The research is followed by the identification of concepts in the real-world data streams. The generalization in the frequency domain adapted in the proposed solution allows to capture the complex feature dependencies as a reduced number of frequency components, while maintaining the semantic meaning of data.
LGNov 11, 2024
Structuring the Processing Frameworks for Data Stream Evaluation and ApplicationJoanna Komorniczak, Paweł Ksieniewicz, Paweł Zyblewski
The following work addresses the problem of frameworks for data stream processing that can be used to evaluate the solutions in an environment that resembles real-world applications. The definition of structured frameworks stems from a need to reliably evaluate the data stream classification methods, considering the constraints of delayed and limited label access. The current experimental evaluation often boundlessly exploits the assumption of their complete and immediate access to monitor the recognition quality and to adapt the methods to the changing concepts. The problem is leveraged by reviewing currently described methods and techniques for data stream processing and verifying their outcomes in simulated environment. The effect of the work is a proposed taxonomy of data stream processing frameworks, showing the linkage between drift detection and classification methods considering a natural phenomenon of label delay.