Filip De Turck

LG
h-index54
12papers
1,233citations
Novelty50%
AI Score54

12 Papers

CRJan 13Code
ConCap: Practical Network Traffic Generation for (ML- and) Flow-based Intrusion Detection Systems

Miel Verkerken, Laurens D'hooge, Bruno Volckaert et al.

Network Intrusion Detection Systems (NIDS) have been studied in research for almost four decades. Yet, despite thousands of papers claiming scientific advances, a non-negligible number of recent works suggest that the findings of prior literature may be questionable. At the root of such a disagreement is the well-known challenge of obtaining data representative of a real-world network -- and, hence, usable for security assessments. We tackle such a challenge in this paper. We propose ConCap, a practical tool meant to facilitate experimental research on NIDS. Through ConCap, a researcher can set up an isolated and lightweight network environment and configure it to produce network-related data, such as packets or NetFlows, that are automatically labeled -- hence ready for fine-grained experiments. ConCap is rooted on open-source software and is designed to foster experimental reproducibility across the scientific community by sharing just one configuration file. Through comprehensive experiments on 10 different network activities, further expanded via in-depth analyses of 21 variants of two specific activities and of 100 repetitions of four other ones, we empirically verify that ConCap produces network data resembling that of a real-world network. We also carry out experiments on well-known benchmark datasets as well as on a real ``smart-home'' network, showing that, from a cyber-detection viewpoint, ConCap's automatically-labeled NetFlows are functionally equivalent to those collected in other environments. Finally, we show that ConCap enables to safely reproduce sophisticated attack chains (e.g., to test/enhance existing NIDS). Altogether, ConCap is a solution to the ``data problem'' that is plaguing NIDS research.

HCMar 28
Personalization in Serious Games and Gamification for Healthcare: A Three-Tiered Review of Models, Methods and Opportunities

Stéphanie Carlier, Femke De Backere, Filip De Turck

Serious games and gamification (SGG) have shown to have positive effects on health outcomes of eHealth applications. However, research has shown that a shift towards a personalized approach is needed, considering the diversity of users. This introduces new challenges to the domain of SGG as research is needed on how such personalization is achieved. A literature search was conducted to provide an overview of personalization strategies. In total, 50 articles were identified, 35 reported on a serious game and 15 focused on gamification. We introduce a three-tiered classification model, including a model level, a personalization paradigm level, and algorithmic framework level to synthesize how personalization is implemented. Data-driven approaches are most common overall (22/50), with knowledge-driven and hybrid methods more prevalent in rehabilitation, reflecting safety and explainability requirements. Popular modeling choices include Hexad-based player modeling and ontologies for expert knowledge integration. Despite encouraging results, reusability remains limited, impeding comparison and knowledge transfer. This review outlines opportunities for progress:shareable knowledge assets, swap-friendly personalization engines, and clinically bounded hybrid approaches, alongside cautious use of generative AI to accelerate design while maintaining safety and explainability. This classification framework and synthesis aims to guide more modular, comparable, and clinically aligned personalized SGG.

DCDec 1, 2025
Delta Sum Learning: an approach for fast and global convergence in Gossip Learning

Tom Goethals, Merlijn Sebrechts, Stijn De Schrijver et al.

Federated Learning is a popular approach for distributed learning due to its security and computational benefits. With the advent of powerful devices in the network edge, Gossip Learning further decentralizes Federated Learning by removing centralized integration and relying fully on peer to peer updates. However, the averaging methods generally used in both Federated and Gossip Learning are not ideal for model accuracy and global convergence. Additionally, there are few options to deploy Learning workloads in the edge as part of a larger application using a declarative approach such as Kubernetes manifests. This paper proposes Delta Sum Learning as a method to improve the basic aggregation operation in Gossip Learning, and implements it in a decentralized orchestration framework based on Open Application Model, which allows for dynamic node discovery and intent-driven deployment of multi-workload applications. Evaluation results show that Delta Sum performance is on par with alternative integration methods for 10 node topologies, but results in a 58% lower global accuracy drop when scaling to 50 nodes. Overall, it shows strong global convergence and a logarithmic loss of accuracy with increasing topology size compared to a linear loss for alternatives under limited connectivity.

CRMay 25
"What is the Problem Space?" Defining Host-space Adversarial Perturbations against Network Intrusion Detection Systems

Miel Verkerken, Laurens D'hooge, Bruno Volckaert et al.

Network Intrusion Detection Systems (NIDS) are now increasingly leveraging Machine Learning (ML) techniques to detect malicious network activities. Numerous papers have scrutinized the security of ML-based NIDS (ML-NIDS) by testing them against various attacks involving adversarial perturbations. The findings were oftentimes worrying: by making imperceptible changes to a given input, powerful ML models would be bypassed. In this context, we took a step back and wondered: where (i.e., in what "space") have these perturbations been applied? We argue that real-world adversaries can apply adversarial perturbations only by operating on the hosts they can control -- a concept which we define as _host-space perturbations_. To some, such an observation may seem trivial. And yet, through a systematic literature review (n=316), we found that prior work applied perturbations by manipulating pre-collected datapoints (e.g., a packet _captured by the router_, or a network flow _analysed by the ML-NIDS_). Such operations, while not impossible, may be outside the reach of an attacker who can only control some (unprivileged) hosts in a network. Hence, to demonstrate how to craft host-space perturbations and study some of their effects, we experimented on well-known benchmarks and a real-world network. We show that ML-NIDS that can detect the SSH-bruteforcing attempts launched via a given command string cannot detect any attempt launched by changing _a single character_ of such a string. We then examined how such a minuscule change in the "problem space" (i.e., the attacker's host) can lead to devastating effects on the "feature space". We derive lessons learned on how to practically assess host-space perturbations. Our stance is that the security of ML-NIDS should be re-assessed.

LGSep 9, 2020
Walk Extraction Strategies for Node Embeddings with RDF2Vec in Knowledge Graphs

Gilles Vandewiele, Bram Steenwinckel, Pieter Bonte et al.

As KGs are symbolic constructs, specialized techniques have to be applied in order to make them compatible with data mining techniques. RDF2Vec is an unsupervised technique that can create task-agnostic numerical representations of the nodes in a KG by extending successful language modelling techniques. The original work proposed the Weisfeiler-Lehman (WL) kernel to improve the quality of the representations. However, in this work, we show both formally and empirically that the WL kernel does little to improve walk embeddings in the context of a single KG. As an alternative to the WL kernel, we propose five different strategies to extract information complementary to basic random walks. We compare these walks on several benchmark datasets to show that the \emph{n-gram} strategy performs best on average on node classification tasks and that tuning the walk strategy can result in improved predictive performances.

SPJan 15, 2020
Overly Optimistic Prediction Results on Imbalanced Data: a Case Study of Flaws and Benefits when Applying Over-sampling

Gilles Vandewiele, Isabelle Dehaene, György Kovács et al.

Information extracted from electrohysterography recordings could potentially prove to be an interesting additional source of information to estimate the risk on preterm birth. Recently, a large number of studies have reported near-perfect results to distinguish between recordings of patients that will deliver term or preterm using a public resource, called the Term/Preterm Electrohysterogram database. However, we argue that these results are overly optimistic due to a methodological flaw being made. In this work, we focus on one specific type of methodological flaw: applying over-sampling before partitioning the data into mutually exclusive training and testing sets. We show how this causes the results to be biased using two artificial datasets and reproduce results of studies in which this flaw was identified. Moreover, we evaluate the actual impact of over-sampling on predictive performance, when applied prior to data partitioning, using the same methodologies of related studies, to provide a realistic view of these methodologies' generalization capabilities. We make our research reproducible by providing all the code under an open license.

NESep 13, 2019
GENDIS: GENetic DIscovery of Shapelets

Gilles Vandewiele, Femke Ongenae, Filip De Turck

In the time series classification domain, shapelets are small time series that are discriminative for a certain class. It has been shown that classifiers are able to achieve state-of-the-art results on a plethora of datasets by taking as input distances from the input time series to different discriminative shapelets. Additionally, these shapelets can easily be visualized and thus possess an interpretable characteristic, making them very appealing in critical domains, such as the health care domain, where longitudinal data is ubiquitous. In this study, a new paradigm for shapelet discovery is proposed, which is based upon evolutionary computation. The advantages of the proposed approach are that (i) it is gradient-free, which could allow to escape from local optima more easily and to find suited candidates more easily and supports non-differentiable objectives, (ii) no brute-force search is required, which drastically reduces the computational complexity by several orders of magnitude, (iii) the total amount of shapelets and length of each of these shapelets are evolved jointly with the shapelets themselves, alleviating the need to specify this beforehand, (iv) entire sets are evaluated at once as opposed to single shapelets, which results in smaller final sets with less similar shapelets that result in similar predictive performances, and (v) discovered shapelets do not need to be a subsequence of the input time series. We present the results of experiments which validate the enumerated advantages.

LGDec 3, 2016
Positive blood culture detection in time series data using a BiLSTM network

Leen De Baets, Joeri Ruyssinck, Thomas Peiffer et al.

The presence of bacteria or fungi in the bloodstream of patients is abnormal and can lead to life-threatening conditions. A computational model based on a bidirectional long short-term memory artificial neural network, is explored to assist doctors in the intensive care unit to predict whether examination of blood cultures of patients will return positive. As input it uses nine monitored clinical parameters, presented as time series data, collected from 2177 ICU admissions at the Ghent University Hospital. Our main goal is to determine if general machine learning methods and more specific, temporal models, can be used to create an early detection system. This preliminary research obtains an area of 71.95% under the precision recall curve, proving the potential of temporal neural networks in this context.

MLNov 17, 2016
GENESIM: genetic extraction of a single, interpretable model

Gilles Vandewiele, Olivier Janssens, Femke Ongenae et al.

Models obtained by decision tree induction techniques excel in being interpretable.However, they can be prone to overfitting, which results in a low predictive performance. Ensemble techniques are able to achieve a higher accuracy. However, this comes at a cost of losing interpretability of the resulting model. This makes ensemble techniques impractical in applications where decision support, instead of decision making, is crucial. To bridge this gap, we present the GENESIM algorithm that transforms an ensemble of decision trees to a single decision tree with an enhanced predictive performance by using a genetic algorithm. We compared GENESIM to prevalent decision tree induction and ensemble techniques using twelve publicly available data sets. The results show that GENESIM achieves a better predictive performance on most of these data sets than decision tree induction techniques and a predictive performance in the same order of magnitude as the ensemble techniques. Moreover, the resulting model of GENESIM has a very low complexity, making it very interpretable, in contrast to ensemble techniques.

AINov 15, 2016
#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning

Haoran Tang, Rein Houthooft, Davis Foote et al.

Count-based exploration algorithms are known to perform near-optimally when used in conjunction with tabular reinforcement learning (RL) methods for solving small discrete Markov decision processes (MDPs). It is generally thought that count-based methods cannot be applied in high-dimensional state spaces, since most states will only occur once. Recent deep RL exploration strategies are able to deal with high-dimensional continuous state spaces through complex heuristics, often relying on optimism in the face of uncertainty or intrinsic motivation. In this work, we describe a surprising finding: a simple generalization of the classic count-based approach can reach near state-of-the-art performance on various high-dimensional and/or continuous deep RL benchmarks. States are mapped to hash codes, which allows to count their occurrences with a hash table. These counts are then used to compute a reward bonus according to the classic count-based exploration theory. We find that simple hash functions can achieve surprisingly good results on many challenging tasks. Furthermore, we show that a domain-dependent learned hash code may further improve these results. Detailed analysis reveals important aspects of a good hash function: 1) having appropriate granularity and 2) encoding information relevant to solving the MDP. This exploration strategy achieves near state-of-the-art performance on both continuous control tasks and Atari 2600 games, hence providing a simple yet powerful baseline for solving MDPs that require considerable exploration.

LGMay 31, 2016
VIME: Variational Information Maximizing Exploration

Rein Houthooft, Xi Chen, Yan Duan et al.

Scalable and effective exploration remains a key challenge in reinforcement learning (RL). While there are methods with optimality guarantees in the setting of discrete state and action spaces, these methods cannot be applied in high-dimensional deep RL scenarios. As such, most contemporary RL relies on simple heuristics such as epsilon-greedy exploration or adding Gaussian noise to the controls. This paper introduces Variational Information Maximizing Exploration (VIME), an exploration strategy based on maximization of information gain about the agent's belief of environment dynamics. We propose a practical implementation, using variational inference in Bayesian neural networks which efficiently handles continuous state and action spaces. VIME modifies the MDP reward function, and can be applied with several different underlying RL algorithms. We demonstrate that VIME achieves significantly better performance compared to heuristic exploration methods across a variety of continuous control tasks and algorithms, including tasks with very sparse rewards.

MLAug 3, 2015
Integrated Inference and Learning of Neural Factors in Structural Support Vector Machines

Rein Houthooft, Filip De Turck

Tackling pattern recognition problems in areas such as computer vision, bioinformatics, speech or text recognition is often done best by taking into account task-specific statistical relations between output variables. In structured prediction, this internal structure is used to predict multiple outputs simultaneously, leading to more accurate and coherent predictions. Structural support vector machines (SSVMs) are nonprobabilistic models that optimize a joint input-output function through margin-based learning. Because SSVMs generally disregard the interplay between unary and interaction factors during the training phase, final parameters are suboptimal. Moreover, its factors are often restricted to linear combinations of input features, limiting its generalization power. To improve prediction accuracy, this paper proposes: (i) Joint inference and learning by integration of back-propagation and loss-augmented inference in SSVM subgradient descent; (ii) Extending SSVM factors to neural networks that form highly nonlinear functions of input features. Image segmentation benchmark results demonstrate improvements over conventional SSVM training methods in terms of accuracy, highlighting the feasibility of end-to-end SSVM training with neural factors.