An Exploratory look at Data Extraction and Machine Learning for Detecting Fraudulent Financial Journal Entries

— The Accounting and Auditing professions are just beginning to recognize the value of Data Mining and Analytics for detection of outliers and 100% population testing. At present, the Financial Auditing profession has relied heavily on random or judgmental sampling to validate the completeness and accuracy of financial statements. This paper will examine a proposed method of examining 100% of the past fiscal year’s financial journal entries straight from the company in question’s Enterprise Resource Planning (ERP) system, applying a machine learning algorithm to the data, and outputting potentially anomalous journal entries to be examined manually in more depth.


Abstract-
The Accounting and Auditing professions are just beginning to recognize the value of Data Mining and Analytics for detection of outliers and 100% population testing. At present, the Financial Auditing profession has relied heavily on random or judgmental sampling to validate the completeness and accuracy of financial statements. This paper will examine a proposed method of examining 100% of the past fiscal year's financial journal entries straight from the company in question's Enterprise Resource Planning (ERP) system, applying a machine learning algorithm to the data, and outputting potentially anomalous journal entries to be examined manually in more depth.

Index Terms-Change detection algorithms, Classification algorithms, Detecting fraudulent financial journal
I. STATEMENT OF PROBLEM THE current financial auditing environment is unreliable in detecting corporate earnings manipulations, accounting errors and fraud. The materiality sample testing approach (add support) used by the Big Four accounting firms is antiquated, and has not kept pace with the changing dynamics of the business landscape. Specifically, with the increases in technology in the light of the revolutionizing work being done at Rutgers University under the Direction of Miklos Vasarhelyi, PhD [1] on the continuous auditing paradigm, the current method of sampling journal entries is not as effective as it could be. Our project employed three-machine learning driven continuous auditing approaches that can be used by both internal and external auditors to identify anomalous journal entries as they occur, instead of waiting months after the fact for a sampling tie out approach.
This method uses a scalable model that extracts data from the entity in question's production financial databases, applies a machine learning algorithm, either classifier or cluster, which can be run ad hoc or can be scheduled on any time frame. Our program is designed to require a low level of maintenance and is easily adjustable by only changing the schedule date and SQL code to adapt to different business and situations.
II. SOLUTION METHODOLOGY Our team has developed a seven-step plan of action for addressing our research question.
1. Obtain and analyze financial accounting data to determine suitability to use in testing. 2. Write SQL queries and python code to import the chosen financial information into python for research. 3. Create a training data set and validate its effectiveness. 4. Transfer and transform the raw financial data into a scikit-learn [2] using the python programming language. 5. Experiment with Cluster and Classifier algorithms to detect journal entry exceptions hat are viable. We will test different various hyper parameters and will test our model with cross validation techniques. 6. Develop a python application that imports the data, performs the chosen algorithm, produces a graphic, and exports the exception results to a network drive. 7. Adapt the model to work for both monthly and yearly time periods.
Our solution methodology focuses on understanding how to determine anomalous potentially fraudulent journal entries, the classification and clustering machine learning algorithms, and combine these two to produce a powerful tool to overcome the problem that our research is trying to address.

III. STATEMENT OF PREVIOUS WORK RELATED TO THE PROBLEM.
There has not been a significant amount of academic research related to using machine learning to test journal entries, compared to other research areas, however a few studies have been conducted using machine learning, although not taking the exact approach of this research team [3] [4].
Since the major frauds of Enron and WorldCom, there has been a significant amount of authoritative guidance regarding the need for thorough journal entry fraud testing [5]; and implementation articles have been published regarding how to perform Journal Entry testing [6].
And how to perform testing using analytics and computerassisted audit tools (CAATs) software, such as CaseWare, Interactive Data Extraction and Analysis (IDEA) and Audit Command Language (ACL) [7].

IV. USING CLUSTERING MACHINE LEARNING ALGORITHM TO ANALYZE FRAUDULENT FINANCIAL JOURNAL ENTRIES
The k-means cluster unsupervised machine-learning algorithm was chosen as one of the algorithms for our project.
The first step was to identify and obtain access to a financial ledger from a live ERP system. One of the team members was able to obtain access through his company, since he works in Internal Audit, and with the blessing of his boss, defined the contents of the relevant relations and wrote an SQL query to retrieve the data.

A. Fetching Data from SQL Database
Through the python library, pyodbc [8], the team was able to import the general ledger transactions from Microsoft Dynamics AX ERP [9] system via the ODBC API (Online Database Connectivity Application Programming Interface) to python for analysis. Below is the SQL (Structured Query Language) query used for importing financial ledger transaction data. In order to use date as a numerical variable in the model, it was converted to an integer format. This query is fetching the journal entries account number, transaction date, posting codes, voucher, who created the journal entry, and the amount of the journal entry. A "WHERE" statement was used to include only journal entries that were posted in the past 60 days, which will allow for model repeatability without having to edit the source code. B. Using K-means Clustering Algorithm to Analyze Data K-means clustering is a method of grouping data into K predetermined group based on minimum sum of squared distance between data point and cluster centroid. For this algorithm the base code was structured off of a tutorial found on the Get Data Science course [10]. The non-numeric features that were used in our analysis, createdby, accountnum, and positing, were converted into a matrix in order to be used in the Scikit Learn K-means clustering algorithm. The DictVectorizer feature extraction was used from the sklearn library for this conversion [11]. The numeric features, amountcur, crediting and date, were converted into a matrix using the hstack command from the scipy (parse) library [12]. These two feature matrixes were then combined and scaled using the sklearn scaling command. Trial and error experimentation was conducted to determine the most effective cluster number and the number of times the model will run with different seeds. 25 clusters and 10 runs were decided on as being the most effective. The model was then fitted and predicted.

C. Testing K-means Clustering Algorithm Output Data for Accuracy
The resulting clustered data was put into pandas [13] dataframe, and matched up to the original data set, which included additional information such as voucher numbers, in order to interpret the results in a useful matter on a tuple by tuple basis. The data was then grouped by clusters and counted; and the clusters containing fewer than 20 tuples were extracted for analysis 1 . The results were then exported to an excel spreadsheet on a shared network drive for analysis. After model testing had been conducted on sample data sets, the full model was run on live production data, and multiple anomalous entries were identified that were deemed worthwhile to investigate. A couple of these anomalous entries were posting by IT Staff, who are not normally authorized to post journal entries. Although a legitimate reason for their entries was found, these sorts of entries are the kind auditors need to efficiently identify to investigate further. and support vector machines [15]. These algorithms were chosen due to the difficulty of the classification task: the classes for this task are extraordinarily unbalanced. Random forests have been shown to be effective at dealing with class imbalance and support vector machines were chosen due to their ability to implement multiple kernels (both linear and non-linear).
Using the data source created by one of our team members, a dataset was created which we split into training and test validation sets for the classification task. The original ratio of occurrence of the classes in this dataset is 99.97% not fraud vs. 0.03% fraudulent journal entries (17 out of 58,843 journal entries).
To prepare the data, we normalized the data with sci-kit-learn [16] and removed non-numeric data features. For additional accuracy, we extracted one feature: the number of transactions for each account grouped by credits and debits. This means our remaining features were the amount, whether the journal entry was a credit or a debit, and the transaction count for that account credit/debit combination. After experimenting with different feature options, we realized that the transaction count became very important to prediction: Figure 1 Importance of Features Initially, with both classification models, accuracy was used to judge the success of our models. It was quickly recognized that this metric was insufficient to meet our needs as the models showed accuracy in Table 3:  Table 3 Classification Model Accuracy Table   From the accuracy metric, the model appeared to be performing very well. But to confirm this, a confusion matrix was created for each model type (please note, there was no significant difference among the SVM kernels, so they have been combined into one confusion matrix): Table 4 and 5

Fraudulent (Predicted) Non Fraudulent (Actual)
17638 7 Fraudulent (Actual) 6 2 Table 5 Confusion Matrix Table for SVM As can be seen in the confusion matrices, the accuracy for true negatives (i.e. the fraudulent invoices) was not sufficient to be of use in an enterprise setting: 11.76% for random forests and 0% for the SVM. After doing additional research [4,5,6] it was decided to attempt to rebalance the classes. It was theorized by balancing the incidence of each class, it would enable the predictor to more accurate predict which class each journal entry belonged to. We also discovered that it was possible to weight each class in the random forest models and gave the classes a 0.001 weight to non-fraudulent class to a 1000 weight to the fraudulent case. The dataset was sampled to a 50% occurrence rate for each class and this dramatically increased the accuracy for the random forest, but lowered it for the SVM:  Table 6 Comparison of Classification Model Accuracy after Balancing However, there was an increase for both models in terms of predicting fraudulent invoices accurately: Table 7    In conclusion, we found that the ability to weight the classes made a significant difference for the random forest algorithm and rebalancing the classes increased the accuracy for the support vector machines. It is our recommendation that the random forest algorithm be used in an enterprise setting, as it had much improved accuracy over each kernel of the SVM implementation.

VI. RECOMMENDATION
The Clustering and Classification algorithms tested provided encouraging results towards implementing machine learning for journal entry fraud testing. Based strictly on the results performed, we found the unsupervised K-means algorithm as being superior due to ability to pick out anomalous entries without requiring significant computing power. The SVM algorithm showed potential however, and we believe it should be explored more fully in future research.

VII. CONCLUSION
The use of machine learning for journal entry fraud testing is a scarcely explored area that produces high quality results for anomalous entries. The ability to succinctly download financial journal entries from an ERP system's underlying relational database is fairly straightforward and provides unlimited opportunities for further analysis. By utilizing the diverse and powerful Python libraries, we were able to create three separate machines learning models that were performed on journal entries downloaded via an ODBC connection from a live production database. The analysis produced high quality results that warrant further research into applying machine learning, specifically K-means clustering, to journal entry fraud testing.