IRCLAug 22, 2013

Automatic Labeling for Entity Extraction in Cyber Security

arXiv:1308.4941v3105 citations
Originality Incremental advance
AI Analysis

This addresses the costly and often infeasible manual annotation for cybersecurity professionals by providing an automated solution.

The paper tackles the problem of lacking training data for entity extraction in cybersecurity by developing an automatic labeling method using domain-specific structured data, achieving near perfect precision, recall, and accuracy with training times under 17 seconds on a corpus of about 750,000 words.

Timely analysis of cyber-security information necessitates automated information extraction from unstructured text. While state-of-the-art extraction methods produce extremely accurate results, they require ample training data, which is generally unavailable for specialized applications, such as detecting security related entities; moreover, manual annotation of corpora is very costly and often not a viable solution. In response, we develop a very precise method to automatically label text from several data sources by leveraging related, domain-specific, structured data and provide public access to a corpus annotated with cyber-security entities. Next, we implement a Maximum Entropy Model trained with the average perceptron on a portion of our corpus ($\sim$750,000 words) and achieve near perfect precision, recall, and accuracy, with training times under 17 seconds.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes