Inshirah Idris

2papers

2 Papers

CLJan 25
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

Pedro Ortiz Suarez, Laurie Burchell, Catherine Arnett et al.

Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.

CROct 16, 2021
An Effective Attack Scenario Construction Model based on Attack Steps and Stages Identification

Taqwa Ahmed Alhaj, Maheyzah Md Siraj, Anazida Zainal et al.

A Network Intrusion Detection System (NIDS) is a network security technology for detecting intruder attacks. However, it produces a great amount of low-level alerts which makes the analysis difficult, especially to construct the attack scenarios. Attack scenario construction (ASC) via Alert Correlation (AC) is important to reveal the strategy of attack in terms of steps and stages that need to be launched to make the attack successful. In most of the existing works, alerts are correlated by classifying the alerts based on the cause-effect relationship. However, the drawback of these works is the identification of false and incomplete correlations due to the infiltration of raw alerts. To address this problem, this work proposes an effective ASC model to discover the complete relationship among alerts. The model is successfully experimented using two types of datasets, which are DARPA 2000, and ISCX2012. The Completeness and Soundness of the proposed model are measured to evaluate the overall correlation effectiveness.