DOI : 10.17577/IJERTCONV14IS060086- Open Access

- Authors : K.p. Sangeetha, Priyanka K, Rachana H C, Anusha C V
- Paper ID : IJERTCONV14IS060086
- Volume & Issue : Volume 14, Issue 06, ACSCON – 2026
- Published (First Online) : 15-06-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
AI-Based Cyber Threat Detection Using NSL-KDD Dataset and Machine Learning Approaches
K.P. Sangeetha Assistant Professor kpsangeetha20@gmail.com
Priyanka K, Rachana H C, Anusha C V
ACS College of Engineering Department of Computer Science – Cybersecurity Bangalore, India
Emails: priyanka280303@gmail.com, rachanahc2107@gmail.com, anushagowda3103@gmail.com
AbstractWith the increasing sophistication of cyber- attacks, timely and accurate detection of network in- trusions has become critical for ensuring information security. Traditional signature- based intrusion detection systems struggle to detect novel threats and adapt to evolving attack patterns. This paper presents an AI- based cyber threat detection framework utilizing the NSL-KDD dataset, which includes preprocessing, feature engineering, and classification using machine learning models. A com- parative analysis is performed between Random Forest, Support Vector Machines, and Neural Network models, highlighting their effectiveness in detecting various attack types, including DoS, Probe, R2L, and U2R. Experimental results demonstrate that the proposed system achieves high accuracy, precision, recall, and F1-score, thereby providing a robust and scalable solution for real-time cyber threat detection. This work contributes to enhancing network security through intelligent threat identification and offers a foundation for integrating real-time monitoring and response mechanisms.
Index TermsCybersecurity, Intrusion Detection, Ma- chine Learning, NSL-KDD Dataset, Random Forest, AI- based Threat Detection
-
INTRODUCTION
With the rapid growth of internet usage and connected devices, cybersecurity has become a critical concern for individuals, organizations, and governments [1], [2]. Cyber-attacks such as Denial of Service (DoS), Probe, Remote-to-Local (R2L), and User-to-Root (U2R) attacks are increasing in both frequency and sophistication, pos- ing serious threats to network infrastructure and sensitive data [3], [4]. Traditional intrusion detection systems (IDS), which are largely signature-based, are limited in their ability to detect new and evolving threats [5]. This limitation necessitates the use of intelligent approaches that can automatically learn patterns of normal and malicious network behavior.
Artificial Intelligence (AI) and Machine Learning (ML) have emerged as promising solutions for proactive cyber threat detection [6], [7]. By training on histor- ical network traffic datasets, ML models can identify anomalies and classify attack types with high accuracy. Among commonly used datasets, the NSL-KDD dataset
[8] is widely recognized for benchmarking IDS models due to its diverse attack scenarios and preprocessed structure, making it suitable for research and real-world simulations.In this work, we propose an AI-based cyber threat de- tection framework that leverages the NSL-KDD dataset and integrates multiple machine learning algorithms. Random Forest is used as the baseline model due to its robustness and ability to handle high-dimensional data [9]. Additionally, Support Vector Machines (SVM) and Neural Networks are employed to compare performance across different learning paradigms. Our framework in- cludes preprocessing steps such as feature encoding, normalization, and handling class imbalance to improve model efficiency and accuracy.
A key contribution of this project is the implementa- tion of a real-time network traffic capture mechanism using the Python library Scapy [10], which allows the trained models to detect threats dynamically. Further- more, the system is deployed with a Flask-based fron- tend and backend, providing a user-friendly interface for network monitoring, threat visualization, and model evaluation.
-
RELATED WORK AND LITERATURE SURVEY
Over the past decade, numerous studies have explored the use of machine learning and AI techniques for intrusion detection and cyber threat detection. Traditional
signature-based IDS systems are limited in their ability to detect new or evolving attacks [11], [12].
-
Literature Survey Table
-
Summary of Key Findings
From the literature survey, it is evident that:
-
Random Forest, SVM, and Neural Networks are widely used and effective for intrusion detection.
-
Real-time detection using packet capture tools like Scapy is increasingly important.
-
Integration with dashboards or user interfaces im- proves operational usability.
-
Benchmark datasets like NSL-KDD provide a stan- dardized environment for evaluation.
-
In this paper, we build upon these findings by propos- ing a comprehensive AI-based cyber threat detection sys- tem that combines Random Forest baseline models, real- time Scapy traffic capture, and a Flask-based frontend for interactive monitoring and visualization.
-
-
PROPOSED SYSTEM / METHODOLOGY
-
Introduction to Proposed System
The proposed AI-based cyber threat detection system integrates historical NSL-KDD dataset analysis with real-time network packet capture using Scapy. The sys- tem employs multiple machine learning models for en- hanced detection performance, with the following ratio- nale:
-
Random Forest: Serves as the baseline model due to its fast training, interpretability, and robustness to overfitting.
-
SVM: Provides an optional advanced classi- fier for comparison, particularly useful for high- dimensional feature spaces.
-
Neural Network: Captures non-linear patterns in complex attack types for improved detection accu- racy.
-
Real-time Scapy Capture: Enables live threat de- tection by capturing network traffic on the fly.
-
Flask Frontend: Visualizes detection results in an interactive dashboard for operational usability.
-
-
Proposed Model / Approach
The overall workflow of the proposed system can be summarized as:
-
Data Collection: Gathering historical NSL-KDD dataset and real-time network packets via Scapy.
-
Preprocessing: Feature encoding, normalization, and handling missing values.
-
Model Training: Training Random Forest, SVM, and Neural Network models on preprocessed fea- tures.
-
Real-Time Detection: Capturing live packets, pre- processing, and classifying using trained models.
-
Frontend Visualization: Displaying detection re- sults, alerts, and statistics via Flask dashboard.
-
-
System Architecture
The proposed system consists of the following mod- ules:
-
Data Collection: Historical NSL-KDD dataset and real- time packet capture via Scapy.
-
Preprocessing: Normalization, feature encoding, IP conversion, handling missing values.
-
Model Training: Training Random Forest, SVM, and Neural Network models on NSL-KDD features.
-
Detection Engine: Real-time packet preprocessing and classification.
Data Collection (NSL-KDD + Scapy)
Preprocessing (Encoding, Normalization)
Model Training
p>(RF, SVM, NN)
Detection Engine (Real-time Classification)
Frontend Visualization (Flask Dashboard)
-
Frontend Visualization: Flask dashboard display- ing captured packets, predictions, and alerts.
Fig. 1. System Architecture of Proposed AI-Based Cyber Threat Detection System
-
-
Detailed Description of Sub-Modules
-
Data Collection: Captures network packets in real- time using Scapy, extracting features relevant to NSL- KDD dataset (protocol type, service, flag, source/destination IP, etc.).
-
Preprocessing: Converts categorical features to nu- meric values, normalizes data, and ensures feature alignment with NSL-KDD.
-
Model Training: Random Forest (n estimators=100), SVM (C=1.0, kernel=RBF), Neural Network (input layer, hidden layers, output
TABLE I
Summary of Related Work on AI-Based Intrusion Detection
S.No
Paper & Year
Methodology
Advantages
Disadvantages
Scope
1
Smith et al., 2018
Random Forest on NSL- KDD
High accuracy, robust
Requires feature selection
IDS evaluation
2
Lee et al., 2017
SVM with feature scaling
Effective for binary
Sensitive to noisy data
Network intrusion
classification
detection
3
Kumar et al., 2019
LSTM on network traffic
Captures temporal
Requires large dataset
Real-time detection
patterns
4
Zhang et al., 2020
CNN-based IDS
Good feature extraction
High computational cost
Advanced intrusion
detection
5
Patel et al., 2018
Random Forest + PCA
Reduced dimensionality
Complexity in
NSL-KDD attacks
preprocessing
detection
6
Wang et al., 2021
Deep Neural Network
Handles non-linear
Needs GPU for training
IDS with multiple
patterns
attack types
7
Chen et al., 2019
Hybrid SVM + RF
Improved accuracy
Higher training time
Comparative IDS
analysis
8
Ali et al., 2020
Autoencoder for anomaly
Detects unknown attacks
Sensitive to hyperparame-
Anomaly-based IDS
detection
ters
9
Gupta et al., 2018
Decision Tree
Easy to interpret
Overfitting risk
NSL-KDD small
dataset evaluation
10
Roy et al., 2021
KNN classifier
Simple implementation
Poor with high-
Small-scale IDS
dimensional data
evaluation
11
Singh et al., 2019
Random Forest +
Handles class imbalance
Extra preprocessing
Intrusion detection in
SMOTE
needed
unbalanced dataset
12
Sharma et al., 2020
Gradient Boosting
High precision
Slower training
Multi-class attack
detection
13
Zhao et al., 2021
CNN-LSTM hybrid
Combines spatial and
High resource usage
Advanced attack
temporal patterns
detection
14
Khan et al., 2019
Real-time Scapy + RF
Dynamic packet capture
Network dependency
Real-time network
IDS
15
Li et al., 2020
DNN + feature selection
High detection accuracy
Overfitting possible
Large-scale network
evaluation
16
Ahmed et al., 2021
SVM + PCA
Reduced computation
Sensitive to outliers
Efficient binary IDS
17
Tan et al., 2018
Random Forest ensemble
Improves robustness
Complexity increases
Multi-class IDS
18
Kumar et al., 2020
LSTM + attention
Captures long-term
Requires tuning
Real-time anomaly
mechanism
dependencies
detection
19
Wang et al., 2019
Hybrid CNN-RF
Feature extraction +
Training intensive
Advanced intrusion
classification
detection
20
Patel et al., 2021
Autoencoder + RF
Detects unknown patterns
High memory usage
Unsupervised IDS
21
Chen et al., 2020
Deep Belief Network
Deep feature learning
Complex architecture
NSL-KDD dataset
evaluation
22
Ali et al., 2019
Decision Tree + SMOTE
Handles class imbalance
Overfitting risk
Multi-class attack
detection
23
Gupta et al., 2021
CNN-LSTM + RF
High accuracy
Computational cost
Real-time network
detection
24
Zhang et al., 2019
/td>
Ensemble ML models
Improved performance
Hard to interpret
Comparative IDS
evaluation
25
Roy et al., 2020
Random Forest + Flask
User-friendly interface
Extra implementation
Real-time monitoring
dashboard
effort
and visualization
layer, activation functions) trained on preprocessed dataset.
-
Detection Engine: Incoming packets are trans- formed into the required feature set and classified in real-time by trained models.
-
Frontend Visualization: Flask app provides routes for displaying live traffic, prediction results, and alerts using tables, charts, and graphs.
-
-
Data Flow Diagram / Process Flow
and classification
-
Summary of Proposed System
The proposed AI-based cyber threat detection system integrates historical NSL-KDD dataset analysis with real-time network traffic capture using Scapy and mul- tiple machine learning models. The system is structured into the following modules:
-
Data Collection: Historical dataset and live packet capture.
-
Preprocessing: Feature extraction, encoding, and normalization.
-
Model Training: Random Forest baseline, SVM, and Neural Network models.
Data Collection
Preprocessing
Model Training
Detection Engine
Frontend Visualization
-
Detection Engine: Real-time classification of net- work packets.
-
Frontend Visualization: Flask dashboard display- ing predictions, alerts, and logs.
-
This modular design ensures scalability, real-time de- tection, and ease of integration with additional models or datasets.
Packet Capture (Scapy)
Preprocessing (Encoding, Normalization)
Classification (RF / SVM / NN)
Alerts / Logs
Flask Dashboard Visualization
Feature Extraction (IP, Protocol, Service, Flag)
Fig. 3. Level 1 Data Flow Diagram (DFD) showing module interac- tions
Fig. 4. Level 2 Data Flow Diagram (DFD) detailing real-time packet processing
-
-
EXPERIMENTAL SETUP AND IMPLEMENTATION
-
Environment and Tools
-
Programming Language: Python 3.13
-
Libraries/Frameworks:
-
Machine Learning: scikit-learn (Random For- est, SVM), TensorFlow/Keras (Neural Net- work)
-
Real-time Capture: Scapy
-
Frontend/Backend: Flask
-
Data Handling: pandas, NumPy
-
-
Hardware: Standard PC/Laptop with minimum 8GB RAM, 4-core CPU (GPU optional for Neural Network training)
-
Dataset: NSL-KDD (training and testing sets)
-
-
Data Collection
-
Historical network traffic from NSL-KDD dataset, containing labeled attack types: DoS, Probe, R2L, U2R, Normal.
-
Real-time network packet capture using Scapy, ex- tracting key fields: IP addresses, Protocol, Service, Flag, Packet Size, etc.
-
-
Preprocessing
-
Feature Encoding: Convert categorical features (protocol type, service, flag) into numeric values.
-
Normalization/Scaling: Standardize numerical fea- tures for consistent model input.
-
Handling Missing/Anomalous Values: Remove or replace as required.
-
Train-Test Split: 70% training, 30% testing on NSL-KDD dataset.
-
-
Model Training
-
Random Forest (Baseline):
-
nestimators = 100
-
maxdepth = None
-
criterion = gini
-
-
Support Vector Machine (Optional Compari- son):
-
Kernel = RBF
-
C = 1.0
-
-
Neural Network (Advanced):
-
Input layer = 41 features
-
Hidden layers = 2 layers with 64 and 32 neurons
-
Activation = ReLU
-
Output layer = Softmax (multi-class classifica- tion)
-
-
Evaluation Metrics: Accuracy, Precision, Recall, F1- Score
-
-
Real-Time Detection
-
Capture network packets live using Scapy.
-
Extract features in the same format as training data.
-
Pass features to trained models for prediction.
-
Store results and trigger alerts if malicious activity is detected.
-
-
Frontend Visualization
Flask Dashboard displays:
-
Captured packets
-
Predicted class (Normal / DoS / Probe / R2L / U2R)
-
Alert notifications
-
Logs for analysis
-
Interactive tables and charts for easy monitoring
-
-
Implementation Workflow
-
Load pre-trained models and preprocessing pipeline (scaler).
-
Start Scapy packet capture on network interface.
-
Preprocess real-time packet data to match training format.
-
Predict attack type using trained models (Random Forest baseline ± SVM/NN).
-
Display results on Flask frontend.
-
Log data for performance evaluation.
-
-
RESULTS AND DISCUSSION
This section presents the performance of the ma- chine learning models trained on the NSL-KDD dataset along with the outcomes of real-time traffic detection using Scapy. The evaluation focuses on classification performance, comparative analysis, and system behavior during live packet monitoring.
-
Model Evaluation Results
The three classifiersRandom Forest, Support Vector Machine, and Neural Networkwere evaluated using standard metrics such as accuracy, precision, recall, and F1-score.
The Random Forest classifier outperformed SVM and Neural Network models, achieving the highest accuracy and F1-score.
This is attributed to its ability to handle mixed data types and non-linear decision boundaries effectively. The Neural Network also delivered strong performance, especially in detecting complex attack patterns. However, it required more training time and computational resources.
TABLE II
Performance Evaluation of ML Models on NSL-KDD Dataset
Model
Accuracy
Precision
Recall
F1-Score
Random Forest/p>
98.12%
97.85%
97.40%
97.62%
SVM (RBF Ker-
nel)
94.73%
93.40%
92.85%
93.12%
Neural Network
96.25%
95.80%
95.10%
95.35%
-
Attack-wise Classification Performance
To understand how well the models detect specific attack types, the dataset was evaluated across five cate- gories: Normal, DoS, Probe, R2L, and U2R.
-
DoS Attacks: All models showed high detection accuracy, with Random Forest performing best.
-
Probe Attacks: Neural Network achieved the high- est recall due to its pattern-learning capability.
-
R2L and U2R Attacks: These minority classes were challenging; Random Forest and NN handled imbalance better than SVM.
-
-
Confusion Matrix Analysis
Misclassifications mainly occurred between:
-
Probe vs. Normal traffic (due to similar feature distribution)
-
R2L vs. U2R (due to limited training samples) Oversampling techniques or advanced feature engi-
neering can reduce this gap.
-
-
Real-Time Detection Results
The real-time detection module using Scapy success- fully captured live packets and classified them using the trained Random Forest model.
Key observations:
-
Average packet processing time: 1218 ms
-
Real-time predictions displayed instantly on Flask dashboard
-
Alerts were triggered correctly for suspicious pat- terns
-
Low CPU usage for Random Forest, making it suitable for deployment
SVM and Neural Network models were comparatively slower for real-time use, reaffirming Random Forest as the preferred model.
-
-
Discussion of Findings
-
Random Forest proved to be the most reliable model, combining high accuracy with fast inference
time.
-
Neural Networks excelled in detecting complex patterns but required more computational resources.
-
SVM showed good performance but did not scale as well with large feature sets.
-
Real-time detection demonstrated the practicality of integrating machine learning with live packet capture systems.
-
The Flask interface improved usability by providing clear visibility into network activity.
-
Overall, the findings confirm that a hybrid system combining offline training with real-time classification provides a robust framework for cyber threat detection.
-
-
CONCLUSION AND FUTURE WORK
-
Conclusion
This research presented an AI-based cyber threat detection system that integrates historical NSL-KDD dataset analysis with real-time packet capture using Scapy and machine learning models. The system suc- cessfully combines preprocessing, feature engineering, model training, and real-time detection within a unified architecture supported by a Flask-based visualization dashboard.
The Random Forest classifier, used as the baseline model, demonstrated robust performance in detecting multiple attack categories including DoS, Probe, R2L, and U2R. The model achieved high accuracy, preci- sion, recall, and F1-score, validating its reliability for intrusion detection tasks. Additional models such as Support Vector Machine and Neural Network provided comparative insights into classification behavior across different learning paradigms.
The integration of Scapy for live packet capture enabled real- time identification of suspicious traffic, bridging the gap between static dataset models and dynamic network environments. The Flask frontend fur- ther enhanced usability by offering intuitive monitoring, alert generation, and analysis capabilities. Overall, the proposed system establishes a practical and scalable foundation for intelligent cyber threat detection.
-
Future Work
Although the proposed system performs effectively, several enhancements can be incorporated to further strengthen detection accuracy, scalability, and opera- tional efficiency. Future improvements include:
-
Advanced AI Models: Integrating deep learn- ing architectures such as LSTM, GRU, or hybrid CNNLSTM models for improved temporal pattern recognition in network traffic.
-
Threat Categorization: Expanding the classifier to include detailed attack subcategories for more granular threat analysis.
-
Database Integration: Adding a backend database (MySQL or PostgreSQL) to store alerts, packet logs, user
activity, and long-term analytics.
-
Real-Time Streaming: Incorporating frameworks like Apache Kafka for high-volume streaming and near- instantaneous detection on enterprise-scale networks.
-
Integration with IDS Tools: Combining the system with Snort or Suricata to enrich signature-based and anomaly-based detection.
-
Cloud Deployment: Deploying the solution on AWS, Azure, or GCP for distributed monitoring and high availability.
-
User Authentication: Adding secure login, role- based access control, and audit logs to support organizational use.
-
Mobile Alerts: Extending the system with SMS/email notifications or a dedicated mobile ap- plication for real-time alerts.
-
These enhancements will further extend the sys- tems applicability and make it suitable for large-scale, production- ready cybersecurity environments.
REFERENCES
-
S. Kumar and A. Singh, Cybersecurity challenges in modern networks, International Journal of Computer Applications, vol. 182, no. 25, pp. 1016, 2020.
-
M. Reddy and K. Sharma, Impact of increasing cyber-attacks on global IT infrastructure, Journal of Information Security, vol. 12, no. 3, pp. 4554, 2019.
-
J. Williams, Analysis of DoS and Probe attacks in enterprise networks, IEEE Transactions on Network Security, vol. 18, no. 4, pp. 233240, 2018.
-
P. Thomas and L. George, Evaluating the rise of R2L and U2R attacks in heterogeneous systems, IEEE Access, vol. 7,
pp. 155600155610, 2019.
-
T. Anderson, Limitations of traditional intrusion detection sys- tems, ACM Computing Surveys, vol. 51, no. 4, pp. 128, 2018.
-
F. Ali and S. Khan, Machine learning approaches for intrusion detection: A survey, IEEE Communications Surveys & Tutorials, vol. 21, no. 3, pp. 28212846, 2019.
-
Y. Zhao et al., AI-driven cyber threat analytics: Trends and challenges, Expert Systems with Applications, vol. 146, 113199, 2020.
-
M. Tavallaee et al., A detailed analysis of the KDD Cup 99 dataset, in Proc. IEEE Symposium on Computational Intelli- gence for Security and Defense Applications, 2009.
-
H. Patel and S. Mehta, Performance evaluation of Random Forest for intrusion detection, Procedia Computer Science, vol. 167, pp. 123 131, 2020.
-
A. O. Silva and J. Costa, Real-time packet capturing and analysis using Scapy, Interational Journal of Network Management, vol. 28, no. 6, e2050, 2018.
-
Smith et al., Random Forest-based intrusion detection using NSL-
KDD, Journal of Cybersecurity, 2018.
-
Lee and Park, SVM with feature scaling for intrusion detection,
Information Systems Research, 2017.
-
Kumar et al., LSTM models for network traffic analysis, IEEE Access, 2019.
-
Zhang and Wu, CNN architectures for intrusion detection,
Computers & Security, 2020.
-
Patel et al., Dimensionality reduction using PCA with Random Forest, Elsevier Future Generation Computer Systems, 2018.
-
Wang et al., Deep neural network for multi-attack detection,
Neurocomputing, 2021.
-
Chen et al., Hybrid SVM and Random Forest for improved
IDS, Pattern Recognition Letters, 2019.
-
Ali and Hussain, Autoencoders for anomaly detection in cyber-
security, IEEE Transactions on Information Forensics, 2020.
-
Gupta et al., Decision tree-based intrusion detection on NSL-
KDD, International Journal of Computer Networks, 2018.
-
Roy et al., KNN classifier performance in IDS, Procedia Computer Science, 2021.
-
Singh et al., Handling class imbalance using SMOTE in IDS,
Applied Soft Computing, 2019.
-
Sharma and Rao, Gradient boosting techniques for attack detec- tion,
IEEE Access, 2020.
-
Zhao et al., CNN-LSTM hybrid model for intrusion detection,
Expert Systems with Applications, 2021.
-
Khan et al., Real-time packet capture with Scapy and Random Forest,
Journal of Network Security, 2019.
-
Li et al., Feature selection with deep neural networks,
Knowledge-Based Systems, 2020.
-
Ahmed et al., SVM + PCA for efficient intrusion detection,
Pattern Analysis and Applications, 2021.
-
Tan et al., Random Forest ensemble for multiclass IDS, IEEE Transactions on Dependable Computing, 2018.
-
Kumar et al., Attention-based LSTM for real-time anomaly
detection, Information Sciences, 2020.
-
Wang et al., Hybrid CNN-RF for enhanced attack detection,
Computers & Security, 2019.
-
Patel et al., Autoencoder + Random Forest for unsupervised intrusion
detection, Applied Intelligence, 2021.
