Karel Hynek

LG
h-index13
7papers
114citations
Novelty36%
AI Score31

7 Papers

NIOct 9, 2023
NetTiSA: Extended IP Flow with Time-series Features for Universal Bandwidth-constrained High-speed Network Traffic Classification

Josef Koumar, Karel Hynek, Jaroslav Pešek et al.

Network traffic monitoring based on IP Flows is a standard monitoring approach that can be deployed to various network infrastructures, even the large IPS-based networks connecting millions of people. Since flow records traditionally contain only limited information (addresses, transport ports, and amount of exchanged data), they are also commonly extended for additional features that enable network traffic analysis with high accuracy. Nevertheless, the flow extensions are often too large or hard to compute, which limits their deployment only to smaller-sized networks. This paper proposes a novel extended IP flow called NetTiSA (Network Time Series Analysed), which is based on the analysis of the time series of packet sizes. By thoroughly testing 25 different network classification tasks, we show the broad applicability and high usability of NetTiSA, which often outperforms the best-performing related works. For practical deployment, we also consider the sizes of flows extended for NetTiSA and evaluate the performance impacts of its computation in the flow exporter. The novel feature set proved universal and deployable to high-speed ISP networks with 100\,Gbps lines; thus, it enables accurate and widespread network security protection.

LGJul 25, 2023
Network Traffic Classification based on Single Flow Time Series Analysis

Josef Koumar, Karel Hynek, Tomáš Čejka

Network traffic monitoring using IP flows is used to handle the current challenge of analyzing encrypted network communication. Nevertheless, the packet aggregation into flow records naturally causes information loss; therefore, this paper proposes a novel flow extension for traffic features based on the time series analysis of the Single Flow Time series, i.e., a time series created by the number of bytes in each packet and its timestamp. We propose 69 universal features based on the statistical analysis of data points, time domain analysis, packet distribution within the flow timespan, time series behavior, and frequency domain analysis. We have demonstrated the usability and universality of the proposed feature vector for various network traffic classification tasks using 15 well-known publicly available datasets. Our evaluation shows that the novel feature vector achieves classification performance similar or better than related works on both binary and multiclass classification tasks. In more than half of the evaluated tasks, the classification performance increased by up to 5\%.

LGSep 27, 2024
CESNET-TimeSeries24: Time Series Dataset for Network Traffic Anomaly Detection and Forecasting

Josef Koumar, Karel Hynek, Tomáš Čejka et al.

Anomaly detection in network traffic is crucial for maintaining the security of computer networks and identifying malicious activities. One of the primary approaches to anomaly detection are methods based on forecasting. Nevertheless, extensive real-world network datasets for forecasting and anomaly detection techniques are missing, potentially causing performance overestimation of anomaly detection algorithms. This manuscript addresses this gap by introducing a dataset comprising time series data of network entities' behavior, collected from the CESNET3 network. The dataset was created from 40 weeks of network traffic of 275 thousand active IP addresses. The ISP origin of the presented data ensures a high level of variability among network entities, which forms a unique and authentic challenge for forecasting and anomaly detection models. It provides valuable insights into the practical deployment of forecast-based anomaly detection approaches.

LGOct 30, 2023
DataZoo: Streamlining Traffic Classification Experiments

Jan Luxemburk, Karel Hynek

The machine learning communities, such as those around computer vision or natural language processing, have developed numerous supportive tools and benchmark datasets to accelerate the development. In contrast, the network traffic classification field lacks standard benchmark datasets for most tasks, and the available supportive software is rather limited in scope. This paper aims to address the gap and introduces DataZoo, a toolset designed to streamline dataset management in network traffic classification and to reduce the space for potential mistakes in the evaluation setup. DataZoo provides a standardized API for accessing three extensive datasets -- CESNET-QUIC22, CESNET-TLS22, and CESNET-TLS-Year22. Moreover, it includes methods for feature scaling and realistic dataset partitioning, taking into consideration temporal and service-related factors. The DataZoo toolset simplifies the creation of realistic evaluation scenarios, making it easier to cross-compare classification methods and reproduce results.

LGFeb 18, 2025
Universal Embedding Function for Traffic Classification via QUIC Domain Recognition Pretraining: A Transfer Learning Success

Jan Luxemburk, Karel Hynek, Richard Plný et al.

Encrypted traffic classification (TC) methods must adapt to new protocols and extensions as well as to advancements in other machine learning fields. In this paper, we follow a transfer learning setup best known from computer vision. We first pretrain an embedding model on a complex task with a large number of classes and then transfer it to five well-known TC datasets. The pretraining task is recognition of SNI domains in encrypted QUIC traffic, which in itself is a problem for network monitoring due to the growing adoption of TLS Encrypted Client Hello. Our training pipeline -- featuring a disjoint class setup, ArcFace loss function, and a modern deep learning architecture -- aims to produce universal embeddings applicable across tasks. The proposed solution, based on nearest neighbors search in the embedding space, surpasses SOTA performance on four of the five TC datasets. A comparison with a baseline method utilizing raw packet sequences revealed unexpected findings with potential implications for the broader TC field. We published the model architecture, trained weights, and transfer learning experiments.

LGJun 10, 2025
When Simple Model Just Works: Is Network Traffic Classification in Crisis?

Kamil Jerabek, Jan Luxemburk, Richard Plny et al.

Machine learning has been applied to network traffic classification (TC) for over two decades. While early efforts used shallow models, the latter 2010s saw a shift toward complex neural networks, often reporting near-perfect accuracy. However, it was recently revealed that a simple k-NN baseline using packet sequences metadata (sizes, times, and directions) can be on par or even outperform more complex methods. In this paper, we investigate this phenomenon further and evaluate this baseline across 12 datasets and 15 TC tasks, and investigate why it performs so well. Our analysis shows that most datasets contain over 50% redundant samples (identical packet sequences), which frequently appear in both training and test sets due to common splitting practices. This redundancy can lead to overestimated model performance and reduce the theoretical maximum accuracy when identical flows have conflicting labels. Given its distinct characteristics, we further argue that standard machine learning practices adapted from domains like NLP or computer vision may be ill-suited for TC. Finally, we propose new directions for task formulation and evaluation to address these challenges and help realign the field.

CRJul 9, 2021
Large Scale Measurement on the Adoption of Encrypted DNS

Sebastián García, Karel Hynek, Dmtrii Vekshin et al.

Several encryption proposals for DNS have been presented since 2016, but their adoption was not comprehensively studied yet. This research measured the current adoption of DoH (DNS over HTTPS), DoT (DNS over TLS), and DoQ (DNS over QUIC) for five months at the beginning of 2021 by three different organizations with global coverage. By comparing the total values, amount of requests per user, and the seasonality of the traffic, it was possible to obtain the current adoption trends. Moreover, we actively scanned the Internet for still-unknown working DoH servers and we compared them with a novel curated list of well-known DoH servers. We conclude that despite growing in 2020, during the first five months of 2021 there was statistically significant evidence that the average amount of Internet traffic for DoH, DoT and DoQ remained stationary. However, we found that the amount of, still unknown and ready to use, DoH servers grew 4 times. These measurements suggest that even though the amount of encrypted DNS is currently not growing, there may probably be more connections soon to those unknown DoH servers for benign and malicious purposes.