CRLGNIMar 26, 2024

Fingerprinting web servers through Transformer-encoded HTTP response headers

arXiv:2404.00056v11 citationsh-index: 1Has Code
Originality Synthesis-oriented
AI Analysis

This provides a more accurate and flexible alternative to rule-based systems for cybersecurity professionals, though it is incremental as it applies existing deep learning methods to a specific domain.

The paper tackled the problem of detecting vulnerable web server versions by using Transformer-encoded HTTP response headers, achieving a macro F1-score of 0.96 for classifying the five most popular web servers and a weighted F1-score of 0.55 for 347 version pairs.

We explored leveraging state-of-the-art deep learning, big data, and natural language processing to enhance the detection of vulnerable web server versions. Focusing on improving accuracy and specificity over rule-based systems, we conducted experiments by sending various ambiguous and non-standard HTTP requests to 4.77 million domains and capturing HTTP response status lines. We represented these status lines through training a BPE tokenizer and RoBERTa encoder for unsupervised masked language modeling. We then dimensionality reduced and concatenated encoded response lines to represent each domain's web server. A Random Forest and multilayer perceptron (MLP) classified these web servers, and achieved 0.94 and 0.96 macro F1-score, respectively, on detecting the five most popular origin web servers. The MLP achieved a weighted F1-score of 0.55 on classifying 347 major type and minor version pairs. Analysis indicates that our test cases are meaningful discriminants of web server types. Our approach demonstrates promise as a powerful and flexible alternative to rule-based systems.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes