Davide Cacciarelli

h-index7

7papers

154citations

Novelty35%

AI Score36

Ranked #101,387 of 194,257 authors (top 52%)#1,436 in ML (top 43%)

7 Papers

22.5MLFeb 17, 2023

Active learning for data streams: a survey

Davide Cacciarelli, Murat Kulahci

Online active learning is a paradigm in machine learning that aims to select the most informative data points to label from a data stream. The problem of minimizing the cost associated with collecting labeled observations has gained a lot of attention in recent years, particularly in real-world applications where data is only available in an unlabeled form. Annotating each observation can be time-consuming and costly, making it difficult to obtain large amounts of labeled data. To overcome this issue, many active learning strategies have been proposed in the last decades, aiming to select the most informative observations for labeling in order to improve the performance of machine learning models. These approaches can be broadly divided into two categories: static pool-based and stream-based active learning. Pool-based active learning involves selecting a subset of observations from a closed pool of unlabeled data, and it has been the focus of many surveys and literature reviews. However, the growing availability of data streams has led to an increase in the number of approaches that focus on online active learning, which involves continuously selecting and labeling observations as they arrive in a stream. This work aims to provide an overview of the most recently proposed approaches for selecting the most informative observations from data streams in real time. We review the various techniques that have been proposed and discuss their strengths and limitations, as well as the challenges and opportunities that exist in this area of research.

11.6MLJul 20, 2022

Stream-based active learning with linear models

Davide Cacciarelli, Murat Kulahci, John Sølve Tyssedal

The proliferation of automated data collection schemes and the advances in sensorics are increasing the amount of data we are able to monitor in real-time. However, given the high annotation costs and the time required by quality inspections, data is often available in an unlabeled form. This is fostering the use of active learning for the development of soft sensors and predictive models. In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data. Several query strategy frameworks for regression have been proposed in the literature but most of the focus has been dedicated to the static pool-based scenario. In this work, we propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner, which must instantaneously decide whether to perform the quality check to obtain the label or discard the instance. The approach is inspired by the optimal experimental design theory and the iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points. The proposed approach is evaluated using numerical simulations and the Tennessee Eastman Process simulator. The results confirm that selecting the examples suggested by the proposed algorithm allows for a faster reduction in the prediction error.

8.6MLFeb 1, 2023

Robust online active learning

Davide Cacciarelli, Murat Kulahci, John Sølve Tyssedal

In many industrial applications, obtaining labeled observations is not straightforward as it often requires the intervention of human experts or the use of expensive testing equipment. In these circumstances, active learning can be highly beneficial in suggesting the most informative data points to be used when fitting a model. Reducing the number of observations needed for model development alleviates both the computational burden required for training and the operational expenses related to labeling. Online active learning, in particular, is useful in high-volume production processes where the decision about the acquisition of the label for a data point needs to be taken within an extremely short time frame. However, despite the recent efforts to develop online active learning strategies, the behavior of these methods in the presence of outliers has not been thoroughly examined. In this work, we investigate the performance of online active linear regression in contaminated data streams. Our study shows that the currently available query strategies are prone to sample outliers, whose inclusion in the training set eventually degrades the predictive performance of the models. To address this issue, we propose a solution that bounds the search area of a conditional D-optimal algorithm and uses a robust estimator. Our approach strikes a balance between exploring unseen regions of the input space and protecting against outliers. Through numerical simulations, we show that the proposed method is effective in improving the performance of online active learning in the presence of outliers, thus expanding the potential applications of this powerful tool.

7.8LGDec 26, 2022

Online Active Learning for Soft Sensor Development using Semi-Supervised Autoencoders

Davide Cacciarelli, Murat Kulahci, John Tyssedal

Data-driven soft sensors are extensively used in industrial and chemical processes to predict hard-to-measure process variables whose real value is difficult to track during routine operations. The regression models used by these sensors often require a large number of labeled examples, yet obtaining the label information can be very expensive given the high time and cost required by quality inspections. In this context, active learning methods can be highly beneficial as they can suggest the most informative labels to query. However, most of the active learning strategies proposed for regression focus on the offline setting. In this work, we adapt some of these approaches to the stream-based scenario and show how they can be used to select the most informative data points. We also demonstrate how to use a semi-supervised architecture based on orthogonal autoencoders to learn salient features in a lower dimensional space. The Tennessee Eastman Process is used to compare the predictive performance of the proposed approaches.

4.1LGNov 4, 2025

Data-driven jet fuel demand forecasting: A case study of Copenhagen Airport

Alessandro Contini, Davide Cacciarelli, Murat Kulahci

Accurate forecasting of jet fuel demand is crucial for optimizing supply chain operations in the aviation market. Fuel distributors specifically require precise estimates to avoid inventory shortages or excesses. However, there is a lack of studies that analyze the jet fuel demand forecasting problem using machine learning models. Instead, many industry practitioners rely on deterministic or expertise-based models. In this research, we evaluate the performance of data-driven approaches using a substantial amount of data obtained from a major aviation fuel distributor in the Danish market. Our analysis compares the predictive capabilities of traditional time series models, Prophet, LSTM sequence-to-sequence neural networks, and hybrid models. A key challenge in developing these models is the required forecasting horizon, as fuel demand needs to be predicted for the next 30 days to optimize sourcing strategies. To ensure the reliability of the data-driven approaches and provide valuable insights to practitioners, we analyze three different datasets. The primary objective of this study is to present a comprehensive case study on jet fuel demand forecasting, demonstrating the advantages of employing data-driven models and highlighting the impact of incorporating additional variables in the predictive models.

1.2APJan 10, 2025Code

Do we actually understand the impact of renewables on electricity prices? A causal inference approach

Davide Cacciarelli, Pierre Pinson, Filip Panagiotopoulos et al.

The energy transition is profoundly reshaping electricity market dynamics. It makes it essential to understand how renewable energy generation actually impacts electricity prices, among all other market drivers. These insights are critical to design policies and market interventions that ensure affordable, reliable, and sustainable energy systems. However, identifying causal effects from observational data is a major challenge, requiring innovative causal inference approaches that go beyond conventional regression analysis only. We build upon the state of the art by developing and applying a local partially linear double machine learning approach. Its application yields the first robust causal evidence on the distinct and non-linear effects of wind and solar power generation on UK wholesale electricity prices, revealing key insights that have eluded previous analyses. We find that, over 2018-2024, wind power generation has a U-shaped effect on prices: at low penetration levels, a 1 GWh increase in energy generation reduces prices by up to 7 GBP/MWh, but this effect gets close to none at mid-penetration levels (20-30%) before intensifying again. Solar power places substantial downward pressure on prices at very low penetration levels (up to 9 GBP/MWh per 1 GWh increase in energy generation), though its impact weakens quite rapidly. We also uncover a critical trend where the price-reducing effects of both wind and solar power have become more pronounced over time (from 2018 to 2024), highlighting their growing influence on electricity markets amid rising penetration. Our study provides both novel analysis approaches and actionable insights to guide policymakers in appraising the way renewables impact electricity markets.

3.7CVApr 16, 2024Code

Camera clustering for scalable stream-based active distillation

Dani Manjah, Davide Cacciarelli, Christophe De Vleeschouwer et al.

We present a scalable framework designed to craft efficient lightweight models for video object detection utilizing self-training and knowledge distillation techniques. We scrutinize methodologies for the ideal selection of training images from video streams and the efficacy of model sharing across numerous cameras. By advocating for a camera clustering methodology, we aim to diminish the requisite number of models for training while augmenting the distillation dataset. The findings affirm that proper camera clustering notably amplifies the accuracy of distilled models, eclipsing the methodologies that employ distinct models for each camera or a universal model trained on the aggregate camera data.