CRCLLGApr 24, 2023

ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain

arXiv:2304.11960v47 citationsh-index: 8Has Code
Originality Incremental advance
AI Analysis

This addresses the time-consuming task of manually scanning online sources for cyber threat intelligence, though it is incremental as it builds on existing NLP extractors by focusing on document search.

The paper tackles the problem of automating the search for cybersecurity documents by proposing ThreatCrawl, a BERT-based focused crawler that dynamically adapts its crawling path, achieving harvest rates of up to 52%, which is better than current state-of-the-art methods.

Publicly available information contains valuable information for Cyber Threat Intelligence (CTI). This can be used to prevent attacks that have already taken place on other systems. Ideally, only the initial attack succeeds and all subsequent ones are detected and stopped. But while there are different standards to exchange this information, a lot of it is shared in articles or blog posts in non-standardized ways. Manually scanning through multiple online portals and news pages to discover new threats and extracting them is a time-consuming task. To automize parts of this scanning process, multiple papers propose extractors that use Natural Language Processing (NLP) to extract Indicators of Compromise (IOCs) from documents. However, while this already solves the problem of extracting the information out of documents, the search for these documents is rarely considered. In this paper, a new focused crawler is proposed called ThreatCrawl, which uses Bidirectional Encoder Representations from Transformers (BERT)-based models to classify documents and adapt its crawling path dynamically. While ThreatCrawl has difficulties to classify the specific type of Open Source Intelligence (OSINT) named in texts, e.g., IOC content, it can successfully find relevant documents and modify its path accord ingly. It yields harvest rates of up to 52%, which are, to the best of our knowledge, better than the current state of the art. The results and source code will be made publicly available upon acceptance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes