Deep Learning to Detect Software Vulnerabilities

Gourav Bansal

doi:10.17577/IJERTV11IS020148

Volume 11, Issue 02 (February 2022)

Deep Learning to Detect Software Vulnerabilities

DOI : 10.17577/IJERTV11IS020148

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 303
Authors : Gourav Bansal
Paper ID : IJERTV11IS020148
Volume & Issue : Volume 11, Issue 02 (February 2022)
Published (First Online): 01-03-2022
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Deep Learning to Detect Software Vulnerabilities

Gourav Bansal

Kurukshetra University

Abstract- The importance of automated vulnerability analysis techniques is growing as more software is developed. In this research, we present a deep learning-based method for learning assembly code in order to detect software flaws. Unlike previous research that relied on API function call sequences, our method begins by storing the assembly code in an immutable vector before using deep learning to learn the assembly language. When it comes to modeling assembly code, we choose Instruction2vec, which is efficient in vectorizing the code. We classify if the new functions have software weaknesses or not after learning the assembly code of the current functions using the vector provided by Instruction2vec. Many ways to detecting vulnerabilities using deep learning have been developed to solve vulnerabilities. Most learning-based approaches, on the other hand, discover vulnerabilities in source code rather than binary code. We present our method for detecting vulnerabilities in binary code in this paper. Our method builds deep learning models to discover vulnerabilities using binary code produced from the SARD dataset.

Keywords- Vulnerability, Binary Code, Vulnerability detection, security, SARD dataset, Deep Learning, symmetric cryptographic algorithms, API function calls.

INTRODUCTION

Detecting vulnerabilities in software systems before they are distributed to consumers is one of the most successful approaches to correcting issues. A wide number of ways to detecting vulnerabilities have been proposed over the years. Fuzzing [6], symbolic execution [2], taint analysis and machine learning [9] are among the approaches they employ. To corrupt computer systems, many real-world cyberattacks [1,7] leveraged software vulnerabilities. A vulnerability was to blame for the recent data breach that exposed the private information of 500 million Facebook users [1]. As a result, resolving vulnerabilities effectively and efficiently is crucial for cybersecurity.

Learning-based approaches have been proven to have the potential for reliable vulnerability detection with recent advances in machine learning, particularly deep learning techniques [8]. Because the program source code contains a wealth of information about the programs, such as data types, variable names, function prototypes, and high- level program constructs, the vast majority of them focus on open-source projects and extract features from the program source code for model training.
APPROACHES ON DETECTING VULNERABILITIES First, we must collect a binary code dataset,

determine the granularity of our vulnerability detection, and construct a vulnerability detection process capability. Do we look for flaws in individual programs, functions, basic blocks, or program slices?

How do we determine which code is susceptible and which is not?

Second, Deep learning systems, for example, require features to distinguish between vulnerable and non- vulnerable code. What characteristics should we look for in binary code.

Our method generates a binary code dataset by compiling C/C++ programs from the SARD dataset, which is extensively used as a testbed for discovering source code vulnerabilities at the function level, the SARD dataset includes labels for susceptible and nonvulnerable code. We chose to discover vulnerabilities at the function level so that we could use the labels that came with the dataset right away. Unlike previous work that analyzes program code as a collection of words or tokens, our method exploits the semantic information contained in binary code assembly instructions as features to train machine learning and deep learning models. Instruction mnemonics, operand types, operand placements, and operand names are among these properties.

We utilize grid search to train an LSTM model with multiple values of hyperparameters on the dataset and compare the performance of the models to find the best hyperparameters. We use the hyperparameter values that yield the best results.
NEURAL NETWORKS

As input to deep learning models, the vector arrays were transformed from two-dimensional vectors to three- dimensional vectors.
The previous models applied a 0.5 threshold to outputs in order to forecast the target class. If the outputs are less than or equal to 0.5, for example, they are categorized as class 0 outputs (Non-vulnerability functions). According to the prior models' performance, the models have some Type I and Type II mistakes, thus the threshold value must be adjusted to minimize the cost of Type 1 and Type 2 errors.

In some cases, such as when using Precision-Recall Curves and ROC Curves, the optimal threshold for the classifier can be calculated directly. A grid search can be

used in different situations to fine-tune the threshold and discover the best value.

To construct the Neural Network outputs in the last layer, the Sigmoid activation function was applied to all preceding models. The outputs are decimal numbers ranging from 0 to 1, indicating whether the outputs are more likely to be categorized as class 0 or 1 depending on the threshold.
DEEP LEARNING

Wu et al. use the sequences of C library function calls as the dataset to create neural network models to forecast vulnerabilities [10]. They turn each sequence of C library function calls into a list of word tokens, then train three neural network models with the vectors: CNN, LSTM, and CNN-LSTM. Their analysis reveals that the neural network models outperform the MLP utilized by VDiscover by a significant margin.

VulDeePecker trains neural networks to discover vulnerabilities using code gadgets, which are computer statements that are data or control reliant on one other. It retrieves relevant program slices as code gadgets and turns them into vectors using word2vec, focusing on library and API function calls. It then uses the vectors to create BLSTM models for vulnerability identification. VulDeePecker considerably lowers false positives when compared to previous machine learning-based studies, such as VulPecker[5].
MACHINE LEARNING

Yamaguchi et al. extract information relevant to API function calls for all functions of a target program from the target program's source code, convert the extracted information into vectors, and use principal component analysis (PCA) on the vectors to identify the dominant API function usage pattern for each function in the target program. It predicts vulnerable functions by comparing the usage pattern of functions to that of a known vulnerable function. An example of a vulnerability is used by the authors to demonstrate the usefulness of the strategy.
CONCLUSION

In the field of cybersecurity, software vulnerability has long been an important but critical research topic. Machine learning (ML)-based approaches have recently sparked increased interest in software vulnerability detection research. The detection performance of existing ML-based approaches, on the other hand, has to be improved. The first is code representation for machine learning, while the second is a class imbalance between susceptible and nonvulnerable code.

This study describes how we used deep learning to find vulnerabilities in binary code. The SARD dataset contains binary code compiled from C/C++ programs, and we leverage the semantic information on assembly instructions as features to train deep learning models. The BLSTM model outperforms the BGRU model in our evaluation. It detects vulnerabilities with an accuracy of 81 percent. The models perform effectively in categorizing both susceptible and nonvulnerable code, as evidenced by the close similarity of macro average and weighted average finding.

REFERENCES

533 million Facebook users phone numbers and personal

data have been leaked online 2021. https://www.businessinsider.com/stolen-data-of-533- millionfacebook-users-leaked-online-2021-4.
Thanassis Avgerinos, Sang Kil Cha, Alexandre Rebert, Edward J Schwartz, Maverick Woo, and David Brumley. 2014. Automatic exploit generation. Commun. ACM 57, 2 (2014), 7484
Gustavo Grieco, Guillermo Luis Grinblat, Lucas Uzal, Sanjay Rawat, Josselin Feist, and Laurent Mounier. 2016. Toward Large- Scale Vulnerability Discovery Using Machine Learning. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy (CODASPY 16). Association for Computing Machinery, New York, NY, USA, 85Ã¢96. https://doi.org/10.1145/2857705.2857720
Seyed Mohammad Ghaffarian and Hamid Reza Shahriari. 2017. Software vulnerability analysis and discovery using machine- learning and data-mining techniques: A survey. ACM Computing Surveys (CSUR) 50, 4 (2017), 136IEEE, August 2016 [Online]. Available: https://ieeexplore.ieee.org/abstract/document/7605062/?casa_token= gYCfDKXeuQoAAAAA:M7ZVVdmCZIDah8vHkHPf4_WJKT6_q w_A0NYJHFbg-LZt1CbmMvMOJox-

EV2Sm_ZDEChddIY6yuObfr8
Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Hanchao Qi, and Jie Hu. 2016. VulPecker: An Automated Vulnerability Detection System Based on Code Similarity Analysis. In Proceedings of the 32nd Annual Conference on Computer Security Applications (ACSAC 16). Association for Computing Machinery, New York,

NY, USA, 201Ã¢213. https://doi.org/10.1145/2991079.2991102
Tielei Wang, Tao Wei, Guofei Gu, and Wei Zou. 2010. TaintScope: A checksumaware directed fuzzing tool for automatic software vulnerability detection. In 2010 IEEE Symposium on Security and Privacy. IEEE, 497512.
VMware Flaw a Vector in SolarWinds Breach? 2020. https://krebsonsecurity.com/2020/12/vmware-flaw-a-vector-in- solarwindsbreach/
Guanjun Lin, Sheng Wen, Qing-Long Han, Jun Zhang, and Yang Xiang. 2020. Software vulnerability detection using deep neural networks: A survey. Proc. IEEE 108, 10 (2020), 18251848.

Deep Learning to Detect Software Vulnerabilities

Leave a Reply