Cybersecurity threats are constantly evolving and becoming increasingly sophisticated, automated, adaptive, and intelligent. This makes it difficult for organizations to defend their digital assets. Industry professionals are looking for solutions to improve the efficiency and effectiveness of cybersecurity operations, adopting different strategies. In cybersecurity, the importance of developing new intrusion detection systems (IDSs) to address these threats has emerged. Most of these systems today are based on machine learning. But these systems need high-quality data to "learn"the characteristics of malicious traffic. Such datasets are difficult to obtain and therefore rarely available. This paper advances the state of the art and presents a new high-quality IDS dataset. The dataset originates from Locked Shields, one of the world's most extensive live-fire cyber defense exercises. This ensures that (i) it contains realistic behavior of attackers and defenders; (ii) it contains sophisticated attacks; and (iii) it contains labels, as the actions of the attackers are well-documented. The dataset includes approximately 16 million network flows, [F3] of which approximately 1.6 million were labeled malicious. What is unique about this dataset is the use of a new labeling technique that increases the accuracy level of data labeling. We evaluate the robustness of our dataset using both quantitative and qualitative methodologies. We begin with a quantitative examination of the Suricata IDS alerts based on signatures and anomalies. Subsequently, we assess the reproducibility of machine learning experiments conducted by K & auml;nzig et al., who used a private Locked Shields dataset. We also apply the quality criteria outlined by the evaluation framework proposed by Gharib et al. Using our dataset with an existing classifier, we demonstrate comparable results (F1 score of 0.997) to the original paper where the classifier was evaluated on a private dataset (F1 score of 0.984)
LSPR23: A novel IDS dataset from the largest live-fire cybersecurity exercise
Melella C.;
2024-01-01
Abstract
Cybersecurity threats are constantly evolving and becoming increasingly sophisticated, automated, adaptive, and intelligent. This makes it difficult for organizations to defend their digital assets. Industry professionals are looking for solutions to improve the efficiency and effectiveness of cybersecurity operations, adopting different strategies. In cybersecurity, the importance of developing new intrusion detection systems (IDSs) to address these threats has emerged. Most of these systems today are based on machine learning. But these systems need high-quality data to "learn"the characteristics of malicious traffic. Such datasets are difficult to obtain and therefore rarely available. This paper advances the state of the art and presents a new high-quality IDS dataset. The dataset originates from Locked Shields, one of the world's most extensive live-fire cyber defense exercises. This ensures that (i) it contains realistic behavior of attackers and defenders; (ii) it contains sophisticated attacks; and (iii) it contains labels, as the actions of the attackers are well-documented. The dataset includes approximately 16 million network flows, [F3] of which approximately 1.6 million were labeled malicious. What is unique about this dataset is the use of a new labeling technique that increases the accuracy level of data labeling. We evaluate the robustness of our dataset using both quantitative and qualitative methodologies. We begin with a quantitative examination of the Suricata IDS alerts based on signatures and anomalies. Subsequently, we assess the reproducibility of machine learning experiments conducted by K & auml;nzig et al., who used a private Locked Shields dataset. We also apply the quality criteria outlined by the evaluation framework proposed by Gharib et al. Using our dataset with an existing classifier, we demonstrate comparable results (F1 score of 0.997) to the original paper where the classifier was evaluated on a private dataset (F1 score of 0.984)I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.