Jaakko Peltonen

h-index25

6papers

19citations

Novelty45%

AI Score42

Ranked #62,382 of 194,257 authors (top 32%)#14,085 in LG (top 35%)

6 Papers

2.0LGAug 31, 2023

Forecasting Emergency Department Crowding with Advanced Machine Learning Models and Multivariable Input

Jalmari Tuominen, Eetu Pulkkinen, Jaakko Peltonen et al.

Emergency department (ED) crowding is a significant threat to patient safety and it has been repeatedly associated with increased mortality. Forecasting future service demand has the potential patient outcomes. Despite active research on the subject, several gaps remain: 1) proposed forecasting models have become outdated due to quick influx of advanced machine learning models (ML), 2) amount of multivariable input data has been limited and 3) discrete performance metrics have been rarely reported. In this study, we document the performance of a set of advanced ML models in forecasting ED occupancy 24 hours ahead. We use electronic health record data from a large, combined ED with an extensive set of explanatory variables, including the availability of beds in catchment area hospitals, traffic data from local observation stations, weather variables, etc. We show that N-BEATS and LightGBM outpeform benchmarks with 11 % and 9 % respective improvements and that DeepAR predicts next day crowding with an AUC of 0.76 (95 % CI 0.69-0.84). To the best of our knowledge, this is the first study to document the superiority of LightGBM and N-BEATS over statistical benchmarks in the context of ED forecasting.

4.4AIMay 18

When Fireflies Cluster; Enhancing Automatic Clustering via Centroid-Guided Firefly Optimization

MKA Ariyaratne, Azwirman Gusrialdi, Yury Nikulin et al.

This work presents a novel variant of the Firefly Algorithm (FA) for data clustering, addressing limitations of traditional methods like K-Means that struggle with non-uniform cluster shapes, densities, and the need for pre-defining the number of clusters. The proposed algorithm introduces a centroid movement strategy and a multi-objective fitness function that balances compactness, separation, and a novel TSP-based navigation penalty. It automatically estimates the optimal number of clusters and dynamically adjusts cluster boundaries. Application to robotic sensor networks highlights its practical value, with experiments showing improved clustering quality and reduced intra-cluster path distances compared to K-Means. These results confirm the algorithm's robustness in complex spatial clustering tasks, with potential for future extensions to higher-dimensional and adaptive scenarios.

9.4LGMay 22, 2025Code

Constrained Non-negative Matrix Factorization for Guided Topic Modeling of Minority Topics

Seyedeh Fatemeh Ebrahimi, Jaakko Peltonen

Topic models often fail to capture low-prevalence, domain-critical themes, so-called minority topics, such as mental health themes in online comments. While some existing methods can incorporate domain knowledge, such as expected topical content, methods allowing guidance may require overly detailed expected topics, hindering the discovery of topic divisions and variation. We propose a topic modeling solution via a specially constrained NMF. We incorporate a seed word list characterizing minority content of interest, but we do not require experts to pre-specify their division across minority topics. Through prevalence constraints on minority topics and seed word content across topics, we learn distinct data-driven minority topics as well as majority topics. The constrained NMF is fitted via Karush-Kuhn-Tucker (KKT) conditions with multiplicative updates. We outperform several baselines on synthetic data in terms of topic purity, normalized mutual information, and also evaluate topic quality using Jensen-Shannon divergence (JSD). We conduct a case study on YouTube vlog comments, analyzing viewer discussion of mental health content; our model successfully identifies and reveals this domain-relevant minority content.

1.2CGSep 2, 2016

Peacock Bundles: Bundle Coloring for Graphs with Globality-Locality Trade-off

Jaakko Peltonen, Ziyuan Lin

Bundling of graph edges (node-to-node connections) is a common technique to enhance visibility of overall trends in the edge structure of a large graph layout, and a large variety of bundling algorithms have been proposed. However, with strong bundling, it becomes hard to identify origins and destinations of individual edges. We propose a solution: we optimize edge coloring to differentiate bundled edges. We quantify strength of bundling in a flexible pairwise fashion between edges, and among bundled edges, we quantify how dissimilar their colors should be by dissimilarity of their origins and destinations. We solve the resulting nonlinear optimization, which is also interpretable as a novel dimensionality reduction task. In large graphs the necessary compromise is whether to differentiate colors sharply between locally occurring strongly bundled edges ("local bundles"), or also between the weakly bundled edges occurring globally over the graph ("global bundles"); we allow a user-set global-local tradeoff. We call the technique "peacock bundles". Experiments show the coloring clearly enhances comprehensibility of graph layouts with edge bundling.

1.5MLNov 19, 2015

An Information Retrieval Approach to Finding Dependent Subspaces of Multiple Views

Ziyuan Lin, Jaakko Peltonen

Finding relationships between multiple views of data is essential both for exploratory analysis and as pre-processing for predictive tasks. A prominent approach is to apply variants of Canonical Correlation Analysis (CCA), a classical method seeking correlated components between views. The basic CCA is restricted to maximizing a simple dependency criterion, correlation, measured directly between data coordinates. We introduce a new method that finds dependent subspaces of views directly optimized for the data analysis task of \textit{neighbor retrieval between multiple views}. We optimize mappings for each view such as linear transformations to maximize cross-view similarity between neighborhoods of data samples. The criterion arises directly from the well-defined retrieval task, detects nonlinear and local similarities, is able to measure dependency of data relationships rather than only individual data coordinates, and is related to well understood measures of information retrieval quality. In experiments we show the proposed method outperforms alternatives in preserving cross-view neighborhood similarities, and yields insights into local dependencies between multiple views.

2.3QMApr 1, 2014

Toward computational cumulative biology by combining models of biological datasets

Ali Faisal, Jaakko Peltonen, Elisabeth Georgii et al.

A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to both include biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer and the model-based search was more accurate than keyword search; it moreover recovered biologically meaningful relationships that are not straightforwardly visible from annotations, for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.