Credit Card Fraud Detection Using Machine Learning

DOI : 10.17577/IJERTCONV7IS10036

Download Full-Text PDF Cite this Publication

Text Only Version

Credit Card Fraud Detection Using Machine Learning

Swaroop K Dept. of ISE SDMCET

Dharwad, India

Amruta D Dept. of ISE SDMCET

Dharwad, India

Sanath J Dept. of ISE SDMCET

Dharwad, India

Pooja G Dept. of ISE SDMCET

Dharwad, India

Abstract In todays world, the most easiest mode of payment is credit card for both online and offline. It helps in providing cashless shopping across the globe. Fraud event occurs only during online payment as credit card number is sufficient to make transaction which will be on the credit card to make online payment but for offline payment password will be asked so during offline transaction frauds cannot occur. In the existing system of detecting fraud transaction, the fraud is detected after the transaction is done. Companies have a detailed analysis of transactional and fraud data. Frauds tends to appear in patterns. In billions of credit card transactions, it is quite difficult to analyse each in isolation. Having predictive algorithms can help to detect fraudulent transactions. this is how data mining comes into play. Data consists of combination of continuous data and nominal data. We can use variety statistical tests to prevent fraud events. Detecting credit card fraud is still not a perfect science. While fraud is still a major financial issue to banks, the distribution of fraud to non-fraudulent transactions is severely skewed towards non-fraudulent transactions. Out of an estimated 12 billion transaction made annually 10 million are fraudulent (this shows every transaction in 1200 is fraudulent transaction).To analyse and predict fraud events we have used local outlier factor and isolation forest algorithms and thus calculated number of fraud transactions. We have calculated the accuracy and number of errors of both the algorithms.

Keywords: Credit card ,Isolated forest ,Local outlier factor, Fraud detection, Data mining.


    In daily routine we use credit cards to buy goods and services using online transaction or physical card for offline transaction .In credit card based purchase, the card holder issues his card to merchant to do payment .the person has to steal the card to make the transaction fraudulent . If the user is not aware of loss of card it leads to financial loss to the user as well as credit card company. When the payment mode is online, attackers require only little information for doing false transaction. Example card number. The only way to detect these kind of fraud is to analyse the spending patterns on every card and irregularities are figured with respect to normal pattern. Fraud which is detected using existing purchase data of card holder is way to reduce the rate of frauds. Every card holder is characterised by patterns containing information about distinctive purchase category the time since the last buying, money spent and other things. Falsehood from such

    patterns is sensed as fraud. Fraud in finance is an ever growing issue, resulting in far reaching consequences. Fraud can be defined as criminal cheating with an aim of financial gain. With an emergence of internet, it has lead to increase in credit card transactions .As credit card is most prevailing method, as it attracts more discounts and offers in both stores and e-commerce, it is more vulnerable to fraud events. Credit card fraud detection is the science and the art of detecting unusual activity in credit transactions . Fraud occurs when the credit card information of the individual is stolen and used to make unauthorized purchases and or withdrawals from the original holders account .A major challenge to credit fraud detection research is the availability of the real world data due to privacy and legal concerns. Online Shopping is one of the largest and fast growing trend and mode of payment will be by using credit card, debit card and net banking. Online payment does not require physical card. If credit card details is known to others that will become a major risk. Currently, card holder will come to know only after the fraud transaction is carried out. No mechanism exist to track fraud transaction. In this project, that is exactly what we are going to be doing as well. Using a dataset of nearly 28,500 credit card transactions and multiple unsupervised anomaly detection algorithms, we are going to identify transactions with a high probability of being credit card fraud. Furthermore, using metrics such as precision, recall, and F1-scores, we will investigate why the classification accuracy for these algorithms can be misleading. In addition, we will explore the use of data visualization techniques common in data science, such as parameter histograms and correlation matrices, to gain a better understanding of the underlying distribution of data in our data set.


    In [2] the authors begin by explaining the method used for transactions through credit cards. They have proposed a system in which they integrate their algorithm with the payment gateway to detect fraudulence in real time. The authors used 7 techniques to develop the algorithm, which are Neural Networks, Rule Induction, Case-based reasoning, Genetic Algorithms, Inductive Logic Programming, Expert Systems, Regression. The authors determined, the ANN method would best serve this problem statement. The output

    of the neural network will be in the form of probability which tells the degree of a transaction being fraudulent. Neural network are trained on information based on the various categories about the card holder such as profession of the card holder, earnings, about the large amount of purchased are placed. The system will use back propagation learning algorithm in this phase to train the network. Depending on the numeric value of probability between 0 and 1, a transaction will be classified into one of the following categories: Non- Fraudulent , Doubtful , Suspicious and Fraudulent. This system being developed will particularly focus on the merchant side of the industry which will be beneficial to the merchant by reducing the merchants losses which he has to bear if a transaction is fraudulent. Therefore it is limited by the availability of Merchant side transaction data which is hard to obtain on scale.

    Authors focused on the Chinese market as it is rapidly growing and fast paced[3]. The authors proposed a data mining technique using outlier detection using distance sum to identify fraud transactions. The authors preferred to use this method over traditional statistical methods like Regression and Discriminant analysis because outlier detection method is independent of the dataset distribution. The paper used Euclidean distance formula to calculate distance sum to detect outliers. The authors calculated a threshold value for distance, if the distance is above said threshold, the object is classified as an anomaly, or in this case, a fraud transaction. The authors collected data from a domestic bank in China, with 16000 observations. The authors achieved a highest accuracy of 89.4% for threshold value of 12. This method is highly dependent on the nature of distribution of the data, and may vary for data sources of different banks.


    The fraud detection module will work in the following steps:

    1. The Incoming set of transactions and amount are treated as credit card transactions.

    2. The credit card transactions are given to machine learning algorithms as an input.

    3. The output will result in either fraud or valid transaction by analyzing the data and observing a pattern and using machine learning algorithms such as local outlir factor and isolation

      forest to do anomaly detection.

    4. The fraud transactions are given to alarm which alerts the

      user that fraud transaction has occurred and the user can block the card to prevent further financial loss to him as well as the credit card company.

    5. The valid transactions are treated as genuine transactions.

    Figure 1 . System block diagram of credit card fraud detection


    We collected the dataset from Kaggle [1].we collected the source code from GitHub[4]. The datasets contains transactions made by credit cards in september2013 by European cardholders shown in figure 2.

    We imported libraries and printed the versions in our code and then we imported necessary packages. we loaded the dataset from the csv file using pandas. we explored the dataset. we have 31 different columns as shown in figure 3.v1 to v28 are the result of PCA dimensionality reduction to protect sensitive information in our dataset like we dont want to expose identity and location of an individual. class 0 indicates valid transaction and class 1 indicates fraud transaction. we have 284807 transactions with 31 columns. further while exploring dataset we noticed that mean values are close to 0 shown in figure 4 it means there are more valid transactions than fraud transactions in our order to save time and computational requirements as it is a large dataset we will take only 10% of the now we have 28401 transactions left. now visually we plot histogram of each parameter to check if there are any unusual parameters as shown in Figure 5.Now we calculated number of fraud and valid cases and outlier fraction by dividing the number of fraud transactions with number of valid transactions as shown in Figure 7. We constructed correlation matrix with heat map to know if there is any strong co relationship between different variables in our dataset as shown in Figure 6.It also says if there is any strong linear relationship and also to know which all features are important for overall classification. But we found that most of the values were close to 0 so hence there was no strong relationships between v

    parameters. We need to format our dataset. We get all columns from data frame, filter columns to remove data that we dont want. We store variable we will be predicting on i.e. X has columns except class label and Y is what we want i.e. it is 1

    dimensional array that has class label for samples as shown in Figure 8.This is unsupervised learning as it is normally detected so we do not want labels to be fed into our network.

    Figure 2 . Contents of dataset.

    Figure 8 . Showing X and Y values.

    Figure 3 . Showing 31 columns of our dataset.

    Figure 4 . Showing useful information such as mean,count of our dataset.

    Figure 5 . Showing histogram of each parameter


    Figure 7 . Showing number of valid and fraud cases as well as outlier fraction.

    Figure 6 . Showing correlation matrix with heat map.


Earlier SVM i.e. support vector machines were used for outlier detection but it took more time for complex datasets. Isolation forest and local outlier factor are anomaly detection methods provided by sk learn package. In local outlier factor method, the anomaly score of each sample is called Local Outlier Factor. It records the local deviation of density of a given sample with respect to its neighbors. The anomaly score depends on how isolated the object is with respect to the surrounding neighbor. In isolation forest algorithm, it separates observations by casually selecting a feature and then randomly selecting a split value between the highest and lowest values of the selected feature. Recursive partitioning is represented by tree structure so we should know the number of splitting to isolate the sample and that is equal to the path length from root to terminating node. This path length is a measure of normality and decision function. Random partitioning produces noticeably shorter paths for anomalies. Forest of random trees produce shorter path lengths for samples and are more prone to be anomalies. We get the y prediction values which will be negative for outlier and 1 for inlier. It is very useful information but we need to process it before we compare to class label .class label is 1 for fraud event and 0 for valid case. We take all inliers, classify them as o i.e. it indicates valid

when we explored the dataset transactions. We take all outliers, classify them as 1 i.e. it indicates fraud transactions . We run classification metrics as it gives useful information such as method name, number of errors,precision,f1 and recall scores.

I. Results

For complex datasets like what we had isolation forest is good method as 30% of time it is able to detect fraud transactions in local outlier factor method, we have 97 total number of errors which is relatively high and accuracy of 99.65942207%.Precision and f1- score are not as good. For class 0 we have precision of 100% and for class 1 it is found to have very less amount of fraudulent transactions.

In Isolation forest method, we have 71 total number of errors which is relatively low and accuracy of 99.750711% For class 1 it is found to have 30% precision. f1 scores are good for isolation forest compared to local outlier factor method. Isolation forest method was able to produce better results as shown in Figure 9.

Figure 9. Showing the method name, total number of errors,precision,f1,recall scores.


We imported csv data set, preprocessed it, exploring and describing data. And plotting histogram to check unusual parameters. We did correlation matrix to know which parameters important for our class. Two algorithm used are

Isolation forest and local outlier factor to do anomaly detection. In the dataset. We realized the importance of understanding the data and precision.

We notice that Isolation Forest is good when compared to Local Outlier Factor in terms of accuracy, number of errors, precision, f1 and recall scores. In future, we can use Neural Networks to train our system for still higher accuracy [5]. We imported csv data set, preprocessed it, exploring and describing data. And plotting histogram to check unusual parameters. We did correlation matrix to know which parameters important for our class. Two algorithms used are Isolation forest and local outlier factor to do anomaly detection. In the dataset, We realized the importance of understanding the data and precision. Fraud detection is a complex issue that requires a substantial amount of planning before throwing machine learning algorithms at it. Nonetheless, it is also an application of data science and machine learning for the good, which makes sure that the customers money is safe and not easily tampered with. Future work will also include implementing the system by using neural networks to train the system for increasing efficiency. Having a data set with non-anonymized features would make this particularly interesting as outputting the feature importance would enable one to see what specific factors are most important for detecting fraudulent transactions.

Some of the advantages are:

  • Reduction in number of fraud transactions.

  • User can safely use his credit card for online transaction.

  • Added layer of security.

    Some drawbacks that can be further improved upon are:

  • Machine learning algorithms work only for huge sets of data. For smaller amount of data the results may be not accurate. It takes a significant amount of data for machine learning models to become accurate. For large organizations, this data volume is not an issue but for others, there must be enough data points to identify legitimate cause and effect relations.


  1. Datasets. (n.d.). Retrieved from

  2. A. Srivastava,M. Yadav, S. Basu, S. Salunkhe and M. Shabad, "Credit card fraud detection at merchant side using neural networks," 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, 2016, pp. 667-670.

  3. W. Yu and N. Wang, "Research on Credit Card Fraud Detection Model Based on Distance Sum," 2009 International Joint Conference on Artificial Intelligence, Hainan Island, 2009, pp. 353-356.

    doi: 10.1109/JCAI.2009.146

  4. Eduonix.(2018,July26).Eduonix/creditcardML.Retrievedfrom

  5. tutorial/

Leave a Reply