Attack Analysis and Prediction using Machine Learning

Utkarsha Bamane; Sumitra Pundlik

doi:10.17577/IJERTV11IS090112

Volume 11, Issue 09 (September 2022)

Attack Analysis and Prediction using Machine Learning

DOI : 10.17577/IJERTV11IS090112

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 158
Authors : Utkarsha Bamane , Sumitra Pundlik
Paper ID : IJERTV11IS090112
Volume & Issue : Volume 11, Issue 09 (September 2022)
Published (First Online): 07-10-2022
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Attack Analysis and Prediction using Machine Learning

Utkarsha Bamane 1, Sumitra Pundlik 2

1MTech Student, Information and Technology (Cyber Security), MIT ADT University, Pune, Maharashtra, India 2Professor,

Dept. of Information and Technology, MIT ADT University, Pune, Maharashtra, India

Abstract: The intentional breach of a defense policy is what intrusion detection is. In order to look for any malicious activities or threats, intrusion detection systems monitor network traffic passing across various types of computer systems and provide warnings when it detects any dangers. Systems for detecting threats should be able to identify every harmful software and event in the network. All forms of attacks, including incursion, file-less malware, botnets, and malware, are changing the threat environment. In order to identify harmful events by examining the program's behavioral pattern, a learning detection system is necessary. In this context, we have form the structure to specify the type of assault that machine learning has recognized. Malicious activity detection can be divided into two categories: signature-based detection and misuse detection. For both types of detection, an IDS must gather the necessary data, evaluate it, and then compare it to attack signatures kept in big databases.

In our study, we suggested a method for creating effective IDS utilizing either the stacking algorithm or the decision tree algorithm. According to the results, the suggested method performs more accurately and efficiently than other methods like logistic regression and random forest. The accuracy rate (%) values for the outcomes produced by the suggested method are 99.36%. Attack analyzer system uses four different algorithms to check multiple types of protocols parameters and authenticate users. After that, it stacks methods with and without feature selection to assess the accuracy and choose the best algorithm to identify which kinds of attacks such as benign, DoS, port scans, brute force attacks, web attacks, bot attacks, and infiltration.

KeyWords: Threat detection, Machine Learning, IDS, malicious, datasets, Intrusion, classifiers, regression.

INTRODUCTION

Threats or malicious activity is found using an intrusion detection system (IDS). To protect a computer network, the IDS takes network-level defensive action. The danger or incursion always manifests itself as an anomaly in a network. The protection of a network is violated when an intruder takes advantage of system defects such as lax security rules, software issues like buffer overflows, and DoS attacks that exploit network flaws. The intruders could be cybercriminals, who are regular internet users who want to steal or harm extremely sensitive data from the victim's system, or they could be system users with fewer privileges who want to have more access to allowed data. The types of intrusion detection methods include signature-based and anomaly-based techniques. A specialized system or piece of software monitors packet flow in the network and compares it to earlier discovered, configured known signatures of known threats. This is known as signature-based detection. Comparing

defined legitimate user parameters with occurrences that reveal divergence from the legitimate user parameters is how the anomaly detection technique finds assaults, in contrast. Whenever malicious behavior occurs in a network, the IDS creates logs and notifies the network administrator. Systems for detecting threats should be able to identify every harmful software and activity in the network. All forms of threats, including incursion, file-less malware, botnets, and malware, are changing the threat environment. In order to identify harmful events by examining the program's behavioral pattern, a learning detection system is necessary. Using machine learning and deep learning approaches, we have created models to recognize the malicious software and system events. Before generating the end outcome, ensemble is a technique for mixing the output of various algorithms.

1.1 OBJECTIVES
1. Threat detection systems that can accurately identify all malicious programmers and network events
2. The threat environment is changing for all forms of attacks, including intrusion, malware, file-less malware, and botnets. To identify malicious occurrences, it is necessary to use a learning detection system that examines the program's behavioral pattern.
3. Secure automatic threat detection and prevention scans the network and server functions and alerts the analyst if any suspicious behavior is found in the network traffic. This method is more efficient at reducing the burden of the analyst. It continuously monitors the system and reacts in accordance with the threat environment.
4Our technology uses a variety of machine learning methods to identify network intrusion. IDS keeps an eye out for malicious behavior and guards against unauthorized access from users, possibly even from insiders, to a computer network.

5) The danger or intrusion manifests as an anomaly in a network. Network faults are exploited by hackers that violate the security of the network by abusing network vulnerabilities like lax security regulations and software problems like buffer overflows.
HISTORY & BACKGROUND

At the center of the project is a machine learning algorithm. The most pertinent items are suggested to users via a recommendation engine, which filters the data using various techniques. It records the user's preferences and inclinations and then proposes alternatives that are consistent with those preferences.

2.1 Algorithm Used

2.1.1 Extra tree Classifier:

Extremely randomised trees (Extra Trees) are a component of ensemble learning methods. The decision trees are constructed by it. The decision rule is drawn at random during tree construction. With the exception of random split value selection, this algorithm's rule is quite similar to that of Random Forest.
PROPOSED SYSTEM ARCHITECTURE

Fig.1. Project Flow

The real project flow, starting with data sampling that is still stacking, is depicted in Fig. 1 above. It also includes the technique used for feature selection.

A training dataset can be transformed using a variety of approaches provided by data sampling in order to balance or better balance the class distribution. The newly converted dataset can be trained directly using normal machine learning techniques after it has been balanced. This enables the difficulty of imbalanced categorization to be treated and overcame using a data preparation strategy, even with substantially imbalanced class distributions.

Data pre-processing is the process of transforming raw data into something that can be utilised to train or test a machine learning model. The initial and most important stage in developing a machine learning model is this one. We rarely see clean, organised data when developing a machine learning project. Additionally, any time you work with data, you need to cleanse it up and prepare it. So, in order to do this, we pre- process data.

The most popular oversampling technique used to address the imbalance issue we previously addressed is called SMOTE (synthetic minority oversampling technique). By boosting the random replication of minority class cases, it seeks to balance class distribution. SMOTE combines already existing minority instances to create new minority instances. For the minority class, it creates virtual training records using linear interpolation. By randomly choosing one or more examples from the minority class, these synthetic training records are created.

Stacking, also known as Stacked Generalization, Exploring a range of several models for the same problem is the goal of stacking. The concept is that you can utilise a learning problem with various sorts of models that can only learn a portion of the problemnot the entire problem field. In order to create an intermediate prediction, you can design numerous learning machines, each of which you utilise to make a single forecast for each taught model. Then you incorporate a fresh model that has the same aim that will gain knowledge from the earlier predictions. The actual objective and the anticipated target will be compared.

This last form is described as being layered on top of one another. As a result, it enhances overall performance and frequently results in a model that is superior to each particular intermediate model. The advantage of this over a single Notice, as is frequently the case with any machine learning technique, is that it does not provide you with any guarantees.

A subset of pertinent features are chosen through the feature selection (or attribute selection) procedure to be used in the model construction [15]. In order to avoid dimensionality in machine learning (ML), boost generalisation by lowering variance, and save training time, feature selection approaches are used. When using the feature selection technique on data, it is common for the data to still have traces of characteristics that are redundant or unnecessary but can b deleted without significantly affecting the datas quality.

Fig.2. Total number of records per attack

With the training dataset's full set of features, four different single classifiers are trained, and predictions are made. Table I displays the accuracy results for the various training methods. Decision tree accuracy is 99.36%, Random Forest accuracy is 98.04%, Extra Tree Classifier accuracy is 98.89%, and XGBoost accuracy is 97.07% according to the decision algorithm. All four algorithms employ the ML staking approach. The total output from each classifier is used as input for the staking procedure, which returns a value of 99.36%.

Table I. Results comparison without feature selection

Method Accuracy Rate (%)

Extra Tree Classifier

98.89

Decision Tree

99.36

XGBoost

97.07

Random Forest

98.04

STACKING

99.36

The four classifiers' importance are averaged to select the features. Four classifiers use chosen features to calculate accuracy. The accuracy of each algorithm is listed below. With the chosen feature, Random Forest and Extra Tree classifiers performed well.

Accuracy Rate %

81.57% 99.31% 99.29%

100.00%

50.00%

0.00%

Logistic Decision Random Regression Tree Forest

Table II. Results comparison with feature selection

Method Accuracy Rate (%)

Extra Tree Classifier

98.28

Decision Tree

99.12

XGBoost

96.59

Random Forest

99.20

STACKING

98.36

Fig.3. Comparison of algorithm with & without feature selection

The graphical representation of the value acquired for each algorithm based on its accuracy is shown above Fig. 3. It can be observed that the suggested approach was successful.

In our project, the Attack Analyzer system authenticates the user and adds data value that uses a decision tree algorithm to identify several attack kinds, such as "Benign," "DoS," "PortScan," "BruteForce," "WebAttack," "Bot," and "Infiltration," before identifying infiltration.

Final Output- which types of attack is detected

Attack Types

Benign

DoS

Port Scan

Brute Force

Web Attack

Bot

Infiltration
CONCLUSION

As there is an increase in devices or large system over internet, the security concerns have also been increased. A learning detection system is required to detect malicious events by analyzing the behavioral pattern of the program. The proposed algorithms decision tree and stacking method has performed well as compared to random forest, extra tree classifier xgboost without any feature selection method implemented. The result obtained by our proposed method has the Accuracy rate (%) is 99.36%.
FUTURE SCOPE

In the future, we plan to hybridize our approach with other machine learning techniques and integrate them with different algorithm by calculating its performance and error rate to

develop a real-time adaptive intrusion detection system that can efficiently detect attacks.

REFERENCE

[1] Intrusion Detection System (IDS): Anomaly Detection Using Outlier Detection Approach International conference on Intelligent Computing, Communication & Convergence (ICCC-2014)

[2] D. Ten, S. Manickam, S. Ramadass, and H. A. Bazar, Study on Advanced Visualization Tools In Network Monitoring Platform, in Third UKSim European Symposium on Computer Modeling and Simulation, EMS 09, Minden Penang, Malaysia, December 2009.

[3] L. Chang, W.L. Chan, J. Chang, P. Ting, M. Netrakanti, A network status monitoring system using personal computer, presented at IEEE Global Telecommunications Conference, August 2002.

[4] Intrusion Detection System (IDS): Anomaly Detection Using Outlier Detection Approach International conference on Intelligent Computing, Communication & Convergence (ICCC-2014)

[5] J. Brownlee, A Tour of Machine Learning Algorithms, https://machinelearningmastery.com/a-tour-of-machine-learning- algorithms/ 2013

[6] E. K. Viegas, A. O. Santin, and L. S. Oliveira, Toward a reliable anomaly-based intrusion detection in real-world environments, Comput. Networks, vol. 127, pp. 200216, 2017.

[7] A. Verma and V. Ranga, Statistical analysis of CIDDS-001 dataset for Network Intrusion Detection Systems using Distance-based Machine learning, Procedia Comput. Sci., vol. 125, pp. 709716, 2018.

[8] T. Hamed, R. Dara, and S. C. Kremer, Network intrusion detection system based on recursive feature addition and bigram technique, Computer. Secure. vol. 73, pp. 137155, 2018.

[9] C. R. Wang, R. F. Xu, S. J. Lee, and C. H. Lee, Network intrusion detection using equality constrained-optimization-based extreme learning machines, Knowledge-Based Syst., vol. 147, pp. 6880, 2018.

[10] G. Fernandes, L. F. Carvalho, J. J. P. C. Rodrigues, and M. L. ProenÃ§a, Network anomaly detection using IP flows with Principal Component Analysis and Ant Colony Optimization, J. Netw. Comput. Appl., vol. 64, pp. 111, 2016.

[11] U. Ravale, N. Marathe, and P. Padiya, Feature selection based hybrid anomaly intrusion detection system using K Means and RBF kernel function, Procedia Comput. Sci., vol. 45, no. C, pp. 428435, 2015.

[12] V. Hajisalem and S. Babaie, A hybrid intrusion detection system based on ABC-AFS algorithm for misuse and anomaly detection, Computer Networks, vol. 136, pp. 3750, 2018.

[13] C. Khammassi and S. Krichen, A GA-LR wrapper approach for feature selection in network intrusion detection, Comput. Secur., vol. 70, pp. 255277, 2017.

[14] M. R. Gauthama Raman, N. Somu, K. Kirthivasan, R. Liscano, and

V. S. Shankar Sriram, An efficient intrusion detection system based on hypergraph – Genetic algorithm for parameter optimization and feature selection in support vector machine, Knowledge-Based Syst., vol. 134, pp. 112, 2017.

[15] S. Shitharth and D. Prince Winston, An enhanced optimization based algorithm for intrusion detection in SCADA network, Comput. Secur., vol. 70, pp. 1626, 2017.

[16] S. M. Hosseini Bamakan, H. Wang, T. Yingjie, and Y. Shi, An effective intrusion detection framework based on MCLP/SVM optimized by time-varying chaos particle swarm optimization, Neurocomputing, vol. 199, pp. 90102, 2016.

[17] D. C. Le and A. N. Zincir-Heywood, Evaluating insider threat detection workflow using supervised and unsupervised learning, in IEEE Security and Privacy Workshops, 2018.

[18] P. A. Legg, O. Buckley, M. Goldsmith, and S. Creese, Automated insider threat detection system using user and role-based profile assessment, IEEE Systems Journal, vol. 11, no. 2, 2017.

[19] A. Tuor, S. Kaplan, B. Hutchinson, N. Nichols, and S. Robinson, Deep learning for unsupervised insider threat detection in structured cybersecurity data streams, in AAAI Workshop on AI for CyberSec., 2017.

[20] B. Bose, B. Avasarala, S. Tirthapura, Y. Y. Chung, and D. Steiner, Detecting insider threats using radish: A system for real-time anomaly detection in heterogeneous data streams, IEEE Systems Journal, 2017.

Method Accuracy Rate (%)
Extra Tree Classifier	98.89
Decision Tree	99.36
XGBoost	97.07
Random Forest	98.04
STACKING	99.36

Method Accuracy Rate (%)
Extra Tree Classifier	98.28
Decision Tree	99.12
XGBoost	96.59
Random Forest	99.20
STACKING	98.36