Jukka K. Nurminen

LG
h-index28
13papers
277citations
Novelty27%
AI Score38

13 Papers

LGJul 11, 2024
Confidence-based Estimators for Predictive Performance in Model Monitoring

Juhani Kivimäki, Jakub Białek, Jukka K. Nurminen et al.

After a machine learning model has been deployed into production, its predictive performance needs to be monitored. Ideally, such monitoring can be carried out by comparing the model's predictions against ground truth labels. For this to be possible, the ground truth labels must be available relatively soon after inference. However, there are many use cases where ground truth labels are available only after a significant delay, or in the worst case, not at all. In such cases, directly monitoring the model's predictive performance is impossible. Recently, novel methods for estimating the predictive performance of a model when ground truth is unavailable have been developed. Many of these methods leverage model confidence or other uncertainty estimates and are experimentally compared against a naive baseline method, namely Average Confidence (AC), which estimates model accuracy as the average of confidence scores for a given set of predictions. However, until now the theoretical properties of the AC method have not been properly explored. In this paper, we try to fill this gap by reviewing the AC method and show that under certain general assumptions, it is an unbiased and consistent estimator of model accuracy with many desirable properties. We also compare this baseline estimator against some more complex estimators empirically and show that in many cases the AC method is able to beat the others, although the comparative quality of the different estimators is heavily case-dependent.

QUANT-PHSep 20, 2024
The Impact of Feature Embedding Placement in the Ansatz of a Quantum Kernel in QSVMs

Ilmo Salmenperä, Ilmars Kuhtarskis, Arianne Meijer van de Griend et al.

Designing a useful feature map for a quantum kernel is a critical task when attempting to achieve an advantage over classical machine learning models. The choice of circuit architecture, i.e. how feature-dependent gates should be interwoven with other gates is a relatively unexplored problem and becomes very important when using a model of quantum kernels called Quantum Embedding Kernels (QEK). We study and categorize various architectural patterns in QEKs and show that existing architectural styles do not behave as the literature supposes. We also produce a novel alternative architecture based on the old ones and show that it performs equally well while containing fewer gates than its older counterparts.

7.8QUANT-PHMar 22
The Average Relative Entropy and Transpilation Depth determines the noise robustness in Variational Quantum Classifiers

Aakash Ravindra Shinde, Arianne Meijer - van de Griend, Jukka K. Nurminen

Variational Quantum Algorithms (VQAs) have been extensively researched for applications in Quantum Machine Learning (QML), Optimization, and Molecular simulations. Although designed for Noisy Intermediate-Scale Quantum (NISQ) devices, VQAs are predominantly evaluated classically due to uncertain results on noisy devices and limited resource availability. Raising concern over the reproducibility of simulated VQAs on noisy hardware. While prior studies indicate that VQAs may exhibit noise resilience in specific parameterized shallow quantum circuits, there are no definitive measures to establish what defines a shallow circuit or the optimal circuit depth for VQAs on a noisy platform. These challenges extend naturally to Variational Quantum Classification (VQC) algorithms, a subclass of VQAs for supervised learning. In this article, we propose a relative entropy-based metric to verify whether a VQC model would perform similarly on a noisy device as it does on simulations. We establish a strong correlation between the average relative entropy difference in classes, transpilation circuit depth, and their performance difference on a noisy quantum device. Our results further indicate that circuit depth alone is insufficient to characterize shallow circuits. We present empirical evidence to support these assertions across a diverse array of techniques for implementing VQC, datasets, and multiple noisy quantum devices.

QUANT-PHNov 5, 2025
Influence of Data Dimensionality Reduction Methods on the Effectiveness of Quantum Machine Learning Models

Aakash Ravindra Shinde, Jukka K. Nurminen

Data dimensionality reduction techniques are often utilized in the implementation of Quantum Machine Learning models to address two significant issues: the constraints of NISQ quantum devices, which are characterized by noise and a limited number of qubits, and the challenge of simulating a large number of qubits on classical devices. It also raises concerns over the scalability of these approaches, as dimensionality reduction methods are slow to adapt to large datasets. In this article, we analyze how data reduction methods affect different QML models. We conduct this experiment over several generated datasets, quantum machine algorithms, quantum data encoding methods, and data reduction methods. All these models were evaluated on the performance metrics like accuracy, precision, recall, and F1 score. Our findings have led us to conclude that the usage of data dimensionality reduction methods results in skewed performance metric values, which results in wrongly estimating the actual performance of quantum machine learning models. There are several factors, along with data dimensionality reduction methods, that worsen this problem, such as characteristics of the datasets, classical to quantum information embedding methods, percentage of feature reduction, classical components associated with quantum models, and structure of quantum machine learning models. We consistently observed the difference in the accuracy range of 14% to 48% amongst these models, using data reduction and not using it. Apart from this, our observations have shown that some data reduction methods tend to perform better for some specific data embedding methodologies and ansatz constructions.

CYApr 17, 2025
How Large Language Models Are Changing MOOC Essay Answers: A Comparison of Pre- and Post-LLM Responses

Leo Leppänen, Lili Aunimo, Arto Hellas et al.

The release of ChatGPT in late 2022 caused a flurry of activity and concern in the academic and educational communities. Some see the tool's ability to generate human-like text that passes at least cursory inspections for factual accuracy ``often enough'' a golden age of information retrieval and computer-assisted learning. Some, on the other hand, worry the tool may lead to unprecedented levels of academic dishonesty and cheating. In this work, we quantify some of the effects of the emergence of Large Language Models (LLMs) on online education by analyzing a multi-year dataset of student essay responses from a free university-level MOOC on AI ethics. Our dataset includes essays submitted both before and after ChatGPT's release. We find that the launch of ChatGPT coincided with significant changes in both the length and style of student essays, mirroring observations in other contexts such as academic publishing. We also observe -- as expected based on related public discourse -- changes in prevalence of key content words related to AI and LLMs, but not necessarily the general themes or topics discussed in the student essays as identified through (dynamic) topic modeling.

LGMay 8, 2025
Performance Estimation in Binary Classification Using Calibrated Confidence

Juhani Kivimäki, Jakub Białek, Wojtek Kuberski et al.

Model monitoring is a critical component of the machine learning lifecycle, safeguarding against undetected drops in the model's performance after deployment. Traditionally, performance monitoring has required access to ground truth labels, which are not always readily available. This can result in unacceptable latency or render performance monitoring altogether impossible. Recently, methods designed to estimate the accuracy of classifier models without access to labels have shown promising results. However, there are various other metrics that might be more suitable for assessing model performance in many cases. Until now, none of these important metrics has received similar interest from the scientific community. In this work, we address this gap by presenting CBPE, a novel method that can estimate any binary classification metric defined using the confusion matrix. In particular, we choose four metrics from this large family: accuracy, precision, recall, and F$_1$, to demonstrate our method. CBPE treats the elements of the confusion matrix as random variables and leverages calibrated confidence scores of the model to estimate their distributions. The desired metric is then also treated as a random variable, whose full probability distribution can be derived from the estimated confusion matrix. CBPE is shown to produce estimates that come with strong theoretical guarantees and valid confidence intervals.

SESep 16, 2021
On Misbehaviour and Fault Tolerance in Machine Learning Systems

Lalli Myllyaho, Mikko Raatikainen, Tomi Männistö et al.

Machine learning (ML) provides us with numerous opportunities, allowing ML systems to adapt to new situations and contexts. At the same time, this adaptability raises uncertainties concerning the run-time product quality or dependability, such as reliability and security, of these systems. Systems can be tested and monitored, but this does not provide protection against faults and failures in adapted ML systems themselves. We studied software designs that aim at introducing fault tolerance in ML systems so that possible problems in ML components of the systems can be avoided. The research was conducted as a case study, and its data was collected through five semi-structured interviews with experienced software architects. We present a conceptualisation of the misbehaviour of ML systems, the perceived role of fault tolerance, and the designs used. Common patterns to incorporating ML components in design in a fault tolerant fashion have started to emerge. ML models are, for example, guarded by monitoring the inputs and their distribution, and enforcing business rules on acceptable outputs. Multiple, specialised ML models are used to adapt to the variations and changes in the surrounding world, and simpler fall-over techniques like default outputs are put in place to have systems up and running in the face of problems. However, the general role of these patterns is not widely acknowledged. This is mainly due to the relative immaturity of using ML as part of a complete software system: the field still lacks established frameworks and practices beyond training to implement, operate, and maintain the software that utilises ML. ML software engineering needs further analysis and development on all fronts.

SEJul 26, 2021
Systematic Literature Review of Validation Methods for AI Systems

Lalli Myllyaho, Mikko Raatikainen, Tomi Männistö et al.

Context: Artificial intelligence (AI) has made its way into everyday activities, particularly through new techniques such as machine learning (ML). These techniques are implementable with little domain knowledge. This, combined with the difficulty of testing AI systems with traditional methods, has made system trustworthiness a pressing issue. Objective: This paper studies the methods used to validate practical AI systems reported in the literature. Our goal is to classify and describe the methods that are used in realistic settings to ensure the dependability of AI systems. Method: A systematic literature review resulted in 90 papers. Systems presented in the papers were analysed based on their domain, task, complexity, and applied validation methods. Results: The validation methods were synthesized into a taxonomy consisting of trial, simulation, model-centred validation, and expert opinion. Failure monitors, safety channels, redundancy, voting, and input and output restrictions are methods used to continuously validate the systems after deployment. Conclusions: Our results clarify existing strategies applied to validation. They form a basis for the synthesization, assessment, and refinement of AI system validation in research and guidelines for validating individual systems in practice. While various validation strategies have all been relatively widely applied, only few studies report on continuous validation. Keywords: artificial intelligence, machine learning, validation, testing, V&V, systematic literature review.

SEJul 22, 2020
Validation Frameworks for Self-Driving Vehicles: A Survey

Francesco Concas, Jukka K. Nurminen, Tommi Mikkonen et al.

As a part of the digital transformation, we interact with more and more intelligent gadgets. Today, these gadgets are often mobile devices, but in the advent of smart cities, more and more infrastructure---such as traffic and buildings---in our surroundings becomes intelligent. The intelligence, however, does not emerge by itself. Instead, we need both design techniques to create intelligent systems, as well as approaches to validate their correct behavior. An example of intelligent systems that could benefit smart cities are self-driving vehicles. Self-driving vehicles are continuously becoming both commercially available and common on roads. Accidents involving self-driving vehicles, however, have raised concerns about their reliability. Due to these concerns, the safety of self-driving vehicles should be thoroughly tested before they can be released into traffic. To ensure that self-driving vehicles encounter all possible scenarios, several millions of hours of testing must be carried out; therefore, testing self-driving vehicles in the real world is impractical. There is also the issue that testing self-driving vehicles directly in the traffic poses a potential safety hazard to human drivers. To tackle this challenge, validation frameworks for testing self-driving vehicles in simulated scenarios are being developed by academia and industry. In this chapter, we briefly introduce self-driving vehicles and give an overview of validation frameworks for testing them in a simulated environment. We conclude by discussing what an ideal validation framework at the state of the art should be and what could benefit validation frameworks for self-driving vehicles in the future.

LGMay 6, 2020
Testing the Robustness of AutoML Systems

Tuomas Halvari, Jukka K. Nurminen, Tommi Mikkonen

Automated machine learning (AutoML) systems aim at finding the best machine learning (ML) pipeline that automatically matches the task and data at hand. We investigate the robustness of machine learning pipelines generated with three AutoML systems, TPOT, H2O, and AutoKeras. In particular, we study the influence of dirty data on accuracy, and consider how using dirty training data may help create more robust solutions. Furthermore, we also analyze how the structure of the generated pipelines differs in different cases.

MMMar 14, 2014
Saving Energy in Mobile Devices for On-Demand Multimedia Streaming -- A Cross-Layer Approach

Mohammad Ashraful Hoque, Matti Siekkinen, Jukka K. Nurminen et al.

This paper proposes a novel energy-efficient multimedia delivery system called EStreamer. First, we study the relationship between buffer size at the client, burst-shaped TCP-based multimedia traffic, and energy consumption of wireless network interfaces in smartphones. Based on the study, we design and implement EStreamer for constant bit rate and rate-adaptive streaming. EStreamer can improve battery lifetime by 3x, 1.5x and 2x while streaming over Wi-Fi, 3G and 4G respectively.

MMNov 18, 2013
Mobile Multimedia Streaming Techniques : QoE and Energy Consumption Perspective

Mohammad Ashraful Hoque, Matti Siekkinen, Jukka K. Nurminen et al.

Multimedia streaming to mobile devices is challenging for two reasons. First, the way content is delivered to a client must ensure that the user does not experience a long initial playback delay or a distorted playback in the middle of a streaming session. Second, multimedia streaming applications are among the most energy hungry applications in smartphones. The energy consumption mostly depends on the delivery techniques and on the power management techniques of wireless access technologies (Wi-Fi, 3G, and 4G). In order to provide insights on what kind of streaming techniques exist, how they work on different mobile platforms, their efforts in providing smooth quality of experience, and their impact on energy consumption of mobile phones, we did a large set of active measurements with several smartphones having both Wi-Fi and cellular network access. Our analysis reveals five different techniques to deliver the content to the video players. The selection of a technique depends on the mobile platform, device, player, quality, and service. The results from our traffic and power measurements allow us to conclude that none of the identified techniques is optimal because they take none of the following facts into account: access technology used, user behavior, and user preferences concerning data waste. We point out the technique with optimal playback buffer configuration, which provides the most attractive trade-offs in particular situations.

MMSep 13, 2012
Investigating Streaming Techniques and Energy Efficiency of Mobile Video Services

Mohammad Ashraful Hoque, Matti Siekkinen, Jukka K. Nurminen et al.

We report results from a measurement study of three video streaming services, YouTube, Dailymotion and Vimeo on six different smartphones. We measure and analyze the traffic and energy consumption when streaming different quality videos over Wi-Fi and 3G. We identify five different techniques to deliver the video and show that the use of a particular technique depends on the device, player, quality, and service. The energy consumption varies dramatically between devices, services, and video qualities depending on the streaming technique used. As a consequence, we come up with suggestions on how to improve the energy efficiency of mobile video streaming services.