- Open Access
- Authors : Mohammed Faizan, Pavithra B
- Paper ID : NCRTCA-PID-082
- Volume & Issue : NCRTCA – 2023 (VOLUME 11 – ISSUE 06)
- Published (First Online): 08-12-2023
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Unveiling The Digital Pandora’s Box-exploring Data Breaches Unraveled And Strategies For Prevention Using Machine Learning
MOHAMMED FAIZAN 
Post Graduate Student,
Dept. of Master of Computer Application D.S.C.E., Bangalore, India firstname.lastname@example.org
PAVITHRA B 
Dept. of Master Computer Application,
Abstract- A data breach refers to an unintended or unauthorized access or disclosure or loss of information which is confidential. Data breach occurs when an organization or a business chain encounters a security issue while storing and processing the information that compromises personal, financial and other sensitive information 
Data Breach is a major concern in today's growing connected world. This abstract provides a summary that explores the causes, consequence and avoidance strategies to reduce the severity of data breaches This defines the purpose of data breach on which there is an involvement of unauthorized individuals or entities having some intentions of the data for various reasons.
This abstract presents a methodology for enhancing data breach detection and prevention through the application of various algorithms and techniques. The focus is on the use of anomaly detection, machine learning algorithms, decision trees, clustering algorithms, and rule-based algorithms to identify and mitigate data breaches. The K means clustering algorithm and Isolation Forest algorithm are highlighted as examples of clustering and machine learning approaches for data breach detection. The effectiveness of these algorithms relies on data preprocessing, feature selection, and continuous monitoring . The purpose of this application is to develop robust cybersecurity measures that effectively mitigate the risks of data breaches and protect sensitive information. The methodology can be implemented using appropriate platforms and datasets, such as user login activity data, IP addresses, geolocation, and timestamps. By leveraging these algorithms
and techniques, organizations can detect anomalies, unusual patterns, and potential breaches, enabling timely response and prevention measures 
The implementation of the proposed methodology utilizes Jupyter Notebook as the primary platform for conducting the experiments. By systematically varying the values of the IP address within the dataset, the objective is to detect anomalies and identify failed login attempts using both the K- means clustering algorithm and the Isolation Forest algorithm. This approach allows for a comprehensive analysis of the dataset, enabling the identification of suspicious activities and potential data breaches.
Anomaly Detection, Clustering Algorithms, K-means Algorithm, Isolation Forest Algorithm, Advanced threat detection, Cloud computing, Incident response and recovery.
As the technology is more advanced and the data generated is vast. Cyber Criminals, hackers and other malicious individuals are constantly looking for loopholes as weakness in system and network to access personal and confidential data.
There are several ways that data breaches can happen, including hacking, malware or ransomware assaults, actual physical theft or loss of devices, insider risks, credential stuffing, and unintended data exposure  Understanding these different types of breaches helps in recognizing vulnerabilities and implementing
appropriate preventive measures. By examining real- life incidents, such as the recent data breach experienced by Domino's Pizza, it becomes evident that data breaches can affect organizations across industries and emphasize the need for robust cybersecurity measures.
Data breaches have become a widespread and worrying issue for firms across all industries in today's linked world. The unauthorized access to sensitive data can lead to severe consequences, including financial loss, reputational damage, customer relationship disruption, and operational disruptions. To combat this growing threat, organizations need to employ robust cybersecurity measures that not only prevent data breaches but also detect them promptly to minimize the impact 
This article delves into different algorithms and techniques that can aid in the detection and prevention of data breaches. It explores anomaly detection, machine learning algorithms, decision trees, clustering algorithms, and rule-based algorithms, highlighting their applications and effectiveness in identifying suspicious patterns and activities. Furthermore, it provides a detailed explanation of the K means clustering algorithm and Isolation Forest algorithm, showcasing how clustering and machine learning can be leveraged for data breach detection 
The article emphasizes the importance of data preprocessing and feature selection to ensure accurate and reliable results from these algorithms. It also underscores the need for continuous monitoring, evaluation, and refinement of cybersecurity measures to stay ahead of evolving threats.
Fig. 1 A Pie Chart showing the distribution of various Data Breaches occurring (Source: Zero Fox)
In the paper at hand, a meticulous survey has been conducted, exploring the landscape of machine learning techniques within the domain of cyber security over the past decade. Within its pages, one will discover an extensive compilation of discussions, statistics, and author profiles, offering a wealth of resources for further investigation. The paper provides a clear outline of the diverse sections and topics covered within the survey, allowing readers to navigate through its contents with ease. Emphasizing the significance of machine learning in the realm of cyber security, the paper delves into its evolutionary trajectory, tracing the remarkable advancements achieved in recent years. Notably, the survey encompasses a wide array of machine learning techniques deployed in cyber security, encompassing both supervised and unsupervised learning approaches. Furthermore, it conscientiously examines the limitations and challenges that arise when employing machine learning in the context of cyber security. As a conclusive note, the paper engages in a thought-provoking discourse on the future prospects of machine learning in the field, underscoring the necessity for continued research endeavors in this domain. To summarize, this survey stands as a comprehensive and all-encompassing exposition on the utilization of machine learning techniques in the realm of cyber security, offering a valuable resource for scholars, practitioners, and enthusiasts seeking to deepen their understanding of this critical intersection 
In the paper presented, a thorough examination is conducted to elucidate the pivotal role of machine learning within the domain of cybersecurity. Within its pages, the advantages and challenges associated with employing machine learning in this field are meticulously explored. The document elucidates how machine learning techniques possess the potential to surpass human-driven detection methods in terms of efficacy, enabling more efficient detection and prevention of cyber attacks. Furthermore, the significance of data quality and quantity in shaping the efficacy of machine learning models is emphasized. By delving into the intrinsic problems that can impede the real-world deployment of machine learning in cybersecurity, such as adversarial attacks and data poisoning, the paper sheds light on critical considerations for practitioners and researchers alike. Moreover, the document offers insightful recommendations for the future development of
machine learning in cybersecurity. It underscores the importance of fostering collaboration among stakeholders to harness the full potential of this technology, while also advocating for the integration of explainable AI methods to enhance transparency and interpretability. In essence, this paper presents a comprehensive overview of the subject matter, imparting a profound understanding of the transformative capabilities of machine learning within the cybersecurity industry 
In the paper presented, an in-depth exploration is undertaken to elucidate the intricacies of the K-Means clustering algorithm, which serves as a valuable tool within the domain of machine learning. This algorithm operates by partitioning a given dataset into a predetermined number of K clusters, each represented by a cluster centroid. The primary objective is to facilitate the grouping of data exhibiting similar characteristics into distinct clusters. The algorithm initializes by selecting K cluster centroids at random. It subsequently assigns data points to their nearest clusters and proceeds to iteratively update the optimal cluster centroids based on the corresponding data points. This iterative process continues until convergence is achieved, resulting in a final set of optimized cluster centroids. It is noteworthy, however, that the K-Means clustering algorithm does not possess the innate capability to precisely differentiate between normal and abnormal behavior. Consequently, researchers and practitioners often combine it with complementary algorithms such as the Naive Bayes Classifier to enhance the accuracy of anomaly detection. By leveraging the strengths of multiple techniques, a more comprehensive and robust framework can be constructed to tackle the challenges associated with distinguishing normal and abnormal behavior patterns. Thus, this paper provides an authoritative exploration of the K-Means clustering algorithm's mechanics, shedding light on its integration with other methodologies to achieve improved accuracy in anomaly detection tasks 
In the paper at hand, a comprehensive framework is presented, focusing on the detection of anomalous user behavior within information security systems. This framework introduces an extended isolation forest algorithm, renowned for its speed, scalability, and ability to operate without the need for exemplar anomalies within the training data set. The study selects a specific subset of users, specifically those
who possess a range of 501-600 access logs, totaling
495 users in entirety. The system undergoes ten iterations across all selected users, wherein each iteration involves the random partitioning of all records into two distinct sets. The training set is subsequently utilized to construct the Isolation Forest tree, serving as the foundational user model. During the testing period, an additional 100 records are randomly chosen from the entire dataset, excluding those associated with the respective user under examination. To assess the efficacy of the approach, key metrics including TP rate, FP rate, precision, recall, and accuracy are calculated. In order to provide a comprehensive analysis, the study conducts a comparative evaluation of seven detection systems, while also delving into the intricacies of five fundamental features. Remarkably, the proposed method exhibits broad applicability across diverse datasets and surpasses existing methodologies in terms of both accuracy and efficiency. In essence, this paper unveils a robust and efficient framework, centered around an extended isolation forest algorithm, poised to revolutionize the detection of anomalous user behavior within information security systems 
The problem at hand is the increasing frequency and impact of data breaches in today's digital landscape. Organizations face significant challenges in detecting and preventing these breaches, leading to financial losses, reputational damage, and customer relationship disruptions. There is a pressing need to explore effective algorithms and techniques that can proactively identify and mitigate data breaches, thereby safeguarding sensitive information and preserving the trust of customers and stakeholders.
Moreover, organizations must also adapt to emerging cybersecurity threats, comply with data protection and privacy regulations, establish robust incident response and recovery plans, and leverage emerging technologies to enhance data security. Addressing these challenges and implementing comprehensive cybersecurity strategies is crucial to reducing the likelihood and impact of data breaches.
Overall, the problem statement revolves around the need to develop and implement effective data breach detection and prevention measures, while also considering the evolving cybersecurity landscape and regulatory requirements.
To enhance data breach detection and prevention, a methodology is proposed that utilizes various algorithms and techniques. The primary focus is on the application of anomaly detection, machine learning algorithms, decision trees, clustering algorithms, and rule-based algorithms. The methodology aims to identify and mitigate data breaches effectively.
The first step in the proposed methodology is data preprocessing. This involves preparing the data for analysis by converting IP addresses into a numerical representation suitable for clustering algorithms. The 32-bit binary representation is commonly used for this purpose.
Next, the K-means clustering algorithm is applied. This algorithm requires the initialization of K cluster centroids, which are randomly selected from the dataset. The IP addresses are then assigned to their closest cluster based on distance calculations using the Hamming distance formula. The algorithm iteratively updates the cluster centroids until convergence is achieved. The result is a set of clusters that group similar IP addresses together 
Following the K-means clustering, the Isolation Forest algorithm is employed. This unsupervised machine learning algorithm is specifically designed for anomaly detection. It constructs a binary tree structure to isolate anomalous instances in the dataset. An anomaly score is assigned to each instance, indicating its degree of deviation from normal behavior. Instances with high anomaly scores are potential data breaches.
To determine the threshold for classifying instances as normal or anomalous, the distribution of anomaly scores is analyzed. Instances above the threshold are considered potential data breaches and require further investigation.
The proposed methodology emphasizes continuous monitoring and evaluation. It is important to assess the performance of the algorithms and adjust the threshold if needed. Fine-tuning the models by experimenting with different hyperparameters or incorporating additional features can further improve their accuracy in detecting data breaches.
To implement the proposed methodology, Jupyter Notebook is recommended as the primary platform for conducting experiments. By systematically varying the values of IP addresses within the dataset, anomalies and failed login attempts can be detected using both the K-means clustering algorithm and the Isolation Forest algorithm. This comprehensive analysis enables the identification of suspicious activities and potential data breaches.
In terms of datasets, user login activity data, including IP addresses, geolocation, timestamps, and other relevant features, can be utilized. These datasets provide valuable information for the algorithms to identify anomalies and potential data breaches.
By leveraging the proposed methodology, organizations can enhance their cybersecurity measures and effectively detect and prevent data breaches. The combination of clustering algorithms and anomaly detection techniques enables the identification of abnormal patterns and behaviors, allowing for timely response and mitigation of risks.
In conclusion, the proposed methodology provides a systematic approach to enhance data breach detection and prevention. By applying clustering algorithms and anomaly detection techniques, organizations can strengthen their cybersecurity measures and protect sensitive information from unauthorized access. Continuous monitoring, evaluation, and refinement of the methodology are crucial to stay ahead of evolving threats and maintain a robust data security posture.
Clustering algorithms can be used for data breach detection by grouping similar data points together and identifying outliers or anomalies. One commonly used clustering algorithm is the K-means algorithm. Here's a brief explanation of the K-means algorithm, along with a diagram and formula, illustrating its working principle.
K-means Clustering Algorithm
To apply the K-means Clustering Algorithm to a dataset of IP addresses, we need to preprocess the data appropriately. Here's a step-by-step derivation of the K-means algorithm for IP addresses:
Step 1: Preprocessing the IP addresses:
Convert each IP address to a numerical representation that can be used for clustering.
One common approach is to convert each IP address into a 32-bit binary representation.
Step 2: Initialization:
Select the number of clusters, K.
Initialize K cluster centroids at random, each of which is represented by a 32-bit binary number. Step 3: Assigning IP addresses to clusters:
Calculate the distance to each centroid for each IP address using a distance measure, such as Hamming distance.
Assign the IP address to the cluster whose centroid is closest.
Step 4: Distance formula (Hamming distance):
The Hamming distance refers to the count of positions where differing bits exist between two binary strings of the same length.
Step 5: Updating cluster centroids:
To determine the mean numerical value of each cluster's IP addresses, update the centroids of the clusters.
Step 6: Repeat step 3 and 4
Repeat steps 3 and 4 as many times as necessary until the centroids stop changing noticeably or the number of iterations reaches a maximum.
Step 7: Final Result
The algorithm converges when the centroids stabilize, and each of the IP address is assigned to its final cluster.
Fig. 2 Flowchart for K-Means Clustering Algorithm (Source: Research Gate)
The formula for calculating Euclidean distance between a data point (xi) and the centroid (ci) in K means algorithm is:
Euclidean Distance (ai, ri) = sqrt ((a1 – r1) ^2 + (a2 – r2) ^2 + … + (an – rn)^2)
where n is the number of features, and xi and ci are the feature values for the data point and centroid, respectively.
Isolation Forest Algorithm
Machine learning algorithms can be utilized to detect and prevent data breaches by analyzing patterns and identifying anomalies in data. One such algorithm commonly used for data breach detection is the Isolation Forest algorithm.
The Isolation Forest is an unsupervised ML algorithm that is based on the principle of isolating outliers in a dataset. It is particularly effective in detecting anomalies and identifying data points that do not conform to the normal behavior of the majority of the dataset 
Fig. 3 A Tree Structure to detect Anomaly (Source: Research Gate)
Isolation Forest algorithm to detect data breaches using this dataset:
Step 1: Data Preprocessing: In this example, the dataset is already in a suitable format, and no preprocessing is required.
Step 2: Feature Selection: For this scenario, we will use the "Login Time," "IP Address," and "Geolocation" features to detect potential data breaches.
Step 3: Training the Model of the Isolation Forest: Using the dataset with both successful and failed login attempts, we can train the algorithm. The algorithm
learns to isolate anomalous instances by constructing a binary tree structure.
Step 4: Detecting the Anomalies: Apply the trained Isolation Forest model to the dataset to identify potential anomalies. The program assigns an anomaly score to each instance, reflecting the degree of divergence from usual behavior.
Step 5: Threshold Determination: Analyze the distribution of anomaly scores and set a suitable threshold for classifying instances as normal or anomalous. Instances with anomaly scores above the threshold are considered potential data breaches.
RESULTS AND DISCUSSION
In the context of data breach detection and prevention, both the K-means clustering algorithm and the Isolation Forest algorithm have shown effectiveness in identifying anomalies and potential data breaches. Let's delve deeper into their applications and discuss their specific results.
K-means Clustering Algorithm: The K-means algorithm is a widely used clustering algorithm that can be applied to detect data breaches by grouping similar data points together and identifying outliers or anomalies. In the case of data related to user login activities, the K-means algorithm can help identify abnormal login behaviors that may indicate potential breaches.
By applying the K-means algorithm to a dataset of IP addresses, the algorithm goes through several steps to cluster the data effectively. These steps involve preprocessing the IP addresses, initializing cluster centroids, assigning IP addresses to clusters based on distance calculations, updating cluster centroids, and repeating these steps until convergence is achieved.
The results obtained from the K-means algorithm can help identify clusters of IP addresses that exhibit similar login patterns. Deviations from the normal behavior can be indicative of potential data breaches. By analyzing the characteristics of these anomalous clusters, organizations can take appropriate security measures to prevent unauthorized access and mitigate the risks of data breaches.
To demonstrate how clustering can help prevent data breaches, let's consider a dataset related to user login activities. This dataset contains information about
users' login behavior, including their IP addresses, login timestamps and the failed login attempts.
Fig. 4 A dataset to detect the number of failed login attempts
The IP address example provided above is a hypothetical scenario to illustrate the application of the K-means algorithm in detecting abnormal login behavior and potential data breaches.
By applying clustering algorithms on this dataset, we can identify patterns and group similar login activities together. This can help identify abnormal login behavior, such as multiple login attempts which have been failed from the same IP address, login attempts from unusual locations, or login patterns that deviate significantly from the norm. By detecting these anomalies, appropriate security measures can be implemented to prevent potential data breaches and unauthorized access to user accounts.
For the purpose of detecting anomalies, the Isolation Forest algorithm is an unsupervised machine learning algorithm. It works on the principle of isolating outliers or anomalies in a dataset. When applied to data breach detection, the Isolation Forest algorithm can effectively identify data points that deviate from the normal behavior of the majority of the dataset.
In the case of user login activities, the Isolation Forest algorithm can be trained on a dataset containing both successful and failed login attempts. The algorithm learns to isolate anomalous instances by constructing a binary tree structure. It assigns an anomaly score to each instance, indicating its degree of deviation from normal behavior.
The results obtained from the Isolation Forest algorithm can help identify instances with high anomaly scores, which suggest potential data breaches. Instances with failed login attempts from unusual geolocations or IP addresses are likely to
receive higher anomaly scores. By setting a suitable threshold, organizations can classify instances as normal or anomalous, enabling them to take appropriate actions to prevent data breaches.
Before Isolation Forest Algorithm
Fig. 5 A dataset that shows the login attempts before detecting anomaly
On applying Isolation Forest Algorithm
Fig. 6 Finding the anomaly score based on earlier data
Datasets for anomaly detection can vary depending on the application, such as network traffic data, sensor readings, financial transactions, or any other domain- specific data where anomalies need to be detected. These datasets are typically collected or curated from relevant sources by researchers, organizations, or data providers. The one used above is a hypothetical
scenario for the application of the Isolation Forest algorithm to detect the anomalies and get the score.
Evaluation and Iteration
Evaluate the performance of the Isolation Forest algorithm using appropriate evaluation metrics and adjust the threshold if needed. You can fine-tune the model by experimenting with different hyperparameters or by incorporating additional features to improve its accuracy in detecting data breaches.
In this example, the Isolation Forest algorithm assigns higher anomaly scores to instances with failed login attempts from unusual geolocations or IP addresses. These instances, with scores above the determined threshold, indicate potential data breaches and should be investigated further to confirm if any malicious activity or unauthorized access has occurred.
Remember, this example is for illustrative purposes only, and the actual implementation and performance of the Isolation Forest algorithm for data breach detection may vary depending on the specific dataset and context.
Comparison and Evaluation
Both the K-means clustering algorithm and the Isolation Forest algorithm have their strengths and limitations in the context of data breach detection.
The K-means algorithm excels in grouping similar data points together and identifying clusters that exhibit abnormal behaviors. It provides insights into patterns and similarities within the dataset. However, the effectiveness of the K-means algorithm heavily relies on appropriate preprocessing of the data and careful selection of features. It may also be sensitive to outliers and noise in the dataset.
On the other hand, the Isolation Forest algorithm is specifically designed for anomaly detection and can effectively identify instances that deviate from normal behavior. It is robust to outliers and can handle high- dimensional datasets. However, the performance of the Isolation Forest algorithm depends on the quality and representativeness of the training dataset, as well as the selection of appropriate features.
Real-life incident on Data Breach
The data breach incident at Domino's Pizza serves as a significant example highlighting the importance of robust cybersecurity measures in the food industry. In this incident, unauthorized individuals gained access to customer data, including names, delivery addresses, email addresses, and contact information. However, it is worth noting that sensitive payment card details were not compromised in this particular data breach.
The breach at Domino's Pizza highlights the importance of organizations prioritizing data security and taking proactive measures to secure client information. Such incidents can have serious implications, including financial losses, reputational harm, and a loss of customer trust.. By learning from this incident, organizations can implement stronger cybersecurity measures to prevent similar breaches and maintain the security of customer data.
In response to the breach, Domino's Pizza likely initiated an investigation to determine the extent of unauthorized access and the potential impact on customers. They would have also taken steps to secure their systems and address any vulnerabilities that may have allowed the breach to occur. Additionally, they may have engaged with law enforcement agencies and regulatory bodies to report the incident and comply with relevant data breach notification requirements.
Furthermore, firms should examine and update their cybersecurity procedures on a regular basis to react to emerging threats and remain ahead of potential vulnerabilities. By implementing these procedures in place, organizations may improve their cybersecurity posture, lower the chance of data breaches, and keep customers' trust and confidence.
It is important to note that the specific actions taken by Domino's Pizza in response to the data breach incident may not be publicly disclosed or available. Therefore, the information provided here is based on general practices and recommendations for organizations in similar situations.
Fig. 7 A graph that shows the top data breaches of all time (Source: Hive Systems)
Consequences of Data Breach
Financial loss: The organizations may face direct cost such as legal fee, regulatory fines and potential lawsuits.
Reputational effects: Data Breach can severely lead to the downfall of the reputation and trust of the organization. Reputational news may lead to negative publicity and public scrutiny.
Damages to Customer Relationship: Data Breaches break customer's trust and confidence, Customers will become hesitant to share their personal data which may lead to reduced interaction and a negative impact.
Identity Theft and Fraud: Personal info which may have breached can be used for identity proxy and fraud. Cyber Criminals or unauthorized persons may use fraudulent accounts of commit financial crimes from the data gained.
Operation Disruption: Data Breach may require patience and time to be solved and get back to normal.
Organizations which have been a victim to data breach would halt their operations temporarily to investigate, impacting productivity and service delivery.
The future enhancement of this research involves continuous monitoring and improvement of cybersecurity measures to stay ahead of evolving threats. Some potential areas for future exploration include:
Advance Threat Detection: To identify and react to new cyber threats in real-time, sophisticated threat detection systems are being developed employing machine learning and artificial intelligence.
Secure Cloud Computing: Investigating secure cloud computing solutions that ensure the confidentiality, integrity, and availability of data stored into the cloud, considering the increasing reliance on cloud-based services.
Regulatory Compliance: Addressing the evolving regulatory landscape and ensuring compliance with data protection and privacy regulations to avoid legal implications and potential fines.
Incident Response and Recovery: Establishing robust incident response and recovery plans to minimize the impact of data breaches and facilitate the timely restoration of normal operations.
Emerging Technologies: Exploring the potential of emerging technologies such as blockchain for enhancing data security and privacy in various industries.
By focusing on these areas, organizations can strengthen their cybersecurity posture and proactively adapt to emerging threats, thereby reducing the likelihood and impact of data breaches.
Data breaches continue to pose significant threats to individuals and organizations in our interconnected world. It is crucial to understand the causes, consequences, and avoidance strategies associated with data breaches. Strong cybersecurity safeguards and proactive measures are essential to address this growing concern.
To enhance data breach detection and prevention, both the K-means clustering algorithm and the Isolation Forest algorithm offer valuable approaches. These algorithms enable organizations to detect anomalies, unusual patterns, and potential breaches, facilitating
timely response and preventive measures. However, it is important to note that relying solely on these algorithms is insufficient. They should be part of a comprehensive cybersecurity strategy that includes network security controls, access management, encryption, and user awareness and training.
Continuous monitoring, evaluation, and refinement of cybersecurity measures are crucial to stay ahead of evolving threats. Organizations should explore advanced threat detection techniques, secure cloud computing solutions, regulatory compliance, incident response and recovery plans, and emerging technologies. By leveraging these measures, organizations can bolster their data breach prevention and detection strategies.
Clustering algorithms, such as the K-means algorithm, group similar data points and identify outliers that may indicate data breaches. The Isolation Forest algorithm is effective in detecting anomalies in datasets. Applying these algorithms to user login activities or sensitive information can help detect abnormal behaviors and prevent potential breaches.
The consequences of data breaches are severe, including financial loss, reputational damage, customer relationship disruption, and operational disruptions. To mitigate these impacts, organizations must prioritize data protection, comply with regulations, and establish robust incident response and recovery plans. Real-life incidents, like the one at Domino's Pizza, serve as reminders of the importance of implementing strong cybersecurity measures.
Looking to the future, continuous monitoring, advanced threat detection, secure cloud computing, regulatory compliance, incident response and recovery, and exploration of emerging technologies will be critical for enhancing data breach prevention and detection strategies.
2019. A. Anderson and R. Fouche, "Cybersecurity incidents and data breaches: An analysis of root causes," Computers & Security, vol. 77, pp. 184-196, 2018.  L. Stevens and T. Davis, "Data breaches: Causes, costs, and preventive strategies," International Journal of Information Management, vol. 37, no. 6, pp. 564-
573, 2017. Q. Chen, D. Preston, and P. Swatman, "Data breaches and their impact on consumer trust: Insights from Australia," Australasian Journal of Information Systems, vol. 20, pp. 1-17, 2016.  J. Yang, W. Yu, and J. Xu, "Data breach detection using clustering algorithms," IEEE Transactions on Dependable and Secure Computing, vol. 15, no. 1, pp. 4-17, 2018.  R. Mishra and S. Garg, "Detecting data breaches through clustering analysis," Journal of Computer Security, vol. 26, no. 2, pp. 135-154, 2018.  Dan Swinhoe, "The 15 biggest data breaches of the 21st century", CSO Online (September 2021)  H. Kang, S. Lee, and C. Lee, "Data breach detection system based on machine learning techniques," in 2016 International Conference on Platform Technology and Service (PLATCON, 2016)  A. MartÃnez-Mendoza, J. F. Villa-Miranda, and
M. I. SÃ¡nchez-Ortiz, "An overview of machine learning techniques for data breach detection," in 2019 4th International Conference on Information Systems and Computer Science (INCISCOS,2019) A. K. Ghosh and A. Schwartzbard, "Learning intrusion detection: Supervised or unsupervised?" in Proceedings of the 1999 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.  2016 cost of data breach study: global analysis. 2017 Available at: https://www- 03.ibm.com/security/data-breach  Alneyadi S, Sithirasenan E, Muthukkumarasamy V- A survey on data leakage prevention systems. J Netw Comput Appl 2016, 62(C):137152  Kamra A, Terzi E, Bertino E. Detecting anomalous access patterns in relational databases.VLDB J 2008,17:10631077  Phyo AH, Furnell SM. A detection-oriented classication of insider IT misuse. Computers & Security 2002:21:6273  D. K. Bhattacharyya and J. K. Kalita, Network Anomaly Detection:A Machine Learning Perspective.
London, U.K.: Chapman & Hall, 2013 V. Ambalavanan, Cyber threats detection and mitigation using machine learning, in Handbook of Research on Machine and Deep Learning Applications for Cyber Security. Hershey, PA, USA: IGI Global, 2020, pp. 132149  T. Thomas, A. P. Vijayaraghavan, and S. Emmanuel, Machine learning and cybersecurity, in Machine Learning Approaches in Cyber Security Analytics. Singapore: Springer, 2020, pp. 3747  I. Firdausi, C. Lim, A. Erwin, and A. S. Nugroho, Analysis of machine learning techniques used in behavior-based malware detection, in Proc. 2nd Int. Conf. Adv. Comput., Control, Telecommun. Technol., Dec. 2010,
pp. 201203 C. Virmani, T. Choudhary, A. Pillai, and M. Rani, Applications of machine learning in cyber security, in Handbook of Research on and Deep Learning Applications for Cyber Security. Hershey, PA, USA:
IGI Global, 2020, pp. 83103 S. Saad, W. Briguglio, and H. Elmiligi, The curious case of Machine Learning in malware detection, 2019  K. Geis, Machine learning: Cybersecurity that can meet the demands of today as well as the demands of tomorrow, Ph.D. dissertation, Master Sci. Cybersecur., Utica College, Utica, NY, USA, 2019  J. M. Torres, C. I. ComesaÃ±a, and P. J. GarcÃa- Nieto, Machine learning techniques applied to cybersecurity, Int. J. Mach. Learn. Cybern.,vol. 10, no. 10, pp. 28232836, 2019  M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, Network anomaly detection: Methods, systems and tools, IEEE Commun. Surveys Tuts., vol. 16, no. 1, pp. 303336, 1st Quart., 2014  P. Mishra, V. Varadharajan, U. Tupakula,and E.
S. Pilli, A detailed investigation and analysis of using machine learning techniques for intrusion detection, IEEE Commun. Surveys Tuts., vol. 21, no. 1, pp. 686728, 1st Quart., 2019 R. Chalapathy and S. Chawla, Deep learning for anomaly detection: A survey, 2019  A. M. Chandrasekhar and K. Raghuveer, Intrusion detection technique by using k-means, fuzzy neural network and SVM classiers, in Proc.Int. Conf. Comput. Commun. Informat., Jan. 2013, pp. 17