CLJul 18, 2025

Label Unification for Cross-Dataset Generalization in Cybersecurity NER

Maciej Jalocha, Johan Hausted Schmidt, William Michelseen

arXiv:2507.13870v22 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This addresses dataset interoperability issues for cybersecurity researchers, but results are incremental with no significant performance gains.

The paper tackled the problem of non-standardized labels in cybersecurity NER by investigating label unification across four datasets, finding that models trained on unified data generalized poorly with only marginal improvements from proposed architectures.

The field of cybersecurity NER lacks standardized labels, making it challenging to combine datasets. We investigate label unification across four cybersecurity datasets to increase data resource usability. We perform a coarse-grained label unification and conduct pairwise cross-dataset evaluations using BiLSTM models. Qualitative analysis of predictions reveals errors, limitations, and dataset differences. To address unification limitations, we propose alternative architectures including a multihead model and a graph-based transfer model. Results show that models trained on unified datasets generalize poorly across datasets. The multihead model with weight sharing provides only marginal improvements over unified training, while our graph-based transfer model built on BERT-base-NER shows no significant performance gains compared BERT-base-NER.

View on arXiv PDF

Similar