Abstract:
Large and varied amounts of data are needed for the research of emerging machine learning (ML) techniques for detecting network threats, such as malware-related threats. The research community has been using a number of network traffic datasets that have been proposed in recent years. The majority of these datasets contain, however, only a few classes of bot and malware, lacking significant diversity and generalization to identify threats. In this work, we considered a heterogeneous dataset of 27.7 million data named VHS-22. This dataset contains flow parameters extracted using a software network probe from four datasets and a network traffic malware monitoring website. Our methodology evaluates different machine learning techniques and the ensemble classifiers. More than 99% of the threats associated with malware are successfully identified by the Bagging Decision Tree, Random Forest, Extremely Randomize Tree, Decision Tree, Histogram Based Gradient Boosting etc. Additionally, we constructed a prototype dataset named MiniVHS-22 from the original VHS-22 dataset to reduce the computational burden for the future researchers on model training and evaluation. We calculated the ratio of normal and attack data in the original dataset and maintained the same ratio in the MiniVHS-22 dataset of 1M data and used different dimensionality reduction techniques such as the Principal Component Analysis (PCA), and Linear Discriminant Analysis (LDA) with varying numbers of principal component values on it and explained our analysis in result section. Sophisticated network traffic threat detection systems can be developed using the results of our investigation.
Description:
This thesis submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Information and Communication Engineering of East West University, Dhaka, Bangladesh