RF-XGBoost Model for Loan Application Scoring in Non Banking Financial Institutions

— Non banking financial institutions which in short form known as NBFCs follow a simple four steps process in processing a loan application. These include filling the online application, uploading documents, credit analysis done by the company and disbursement. In this paper, we proposed a hybrid model called Random Forest Extreme Gradient Boosting (RF-XGBoost). In this model, application score is calculated and then status of the application is predicted whether to approve the loan or reject it. Here, a Random forest classifier is used to get the importance of each feature. Further, XGBoost and a simple neural net are used to predict the loan status. This entire lending space in assessing the credit risk is based on Artificial Intelligence technology, making it a profitable situation for borrowers and the platform lenders. Generating the unique score for the loan application will help the lender to process the loan more effectively to the customer. The experimental results showed that, RF-XGBoost model performance is best compared to Logistic Regression, Random Forest, Linear SVM and Neural Net-3 Layers.


INTRODUCTION
Before applying for a loan, it is essential to understand the application process to successfully get your loan to be approved. Loans provided by the organizations are useful for many short terms and long term works in our life. In addition, having a good credit score increases the value of the loan application. Personal credit scoring plays a very crucial role in the risk management of many commercial banks. These credit scoring techniques are capable of reducing the risk management of many banks. With the growth of credit cards and personal loans, credit scoring technology has been widely applied into many lending areas and loan management of banks. Credit score helps the organizations in making the loan approval processing much easier. In the literature Orgler (1970) used linear regression method for assessing the customers loan risk [1].Logistic Regression model was applied by Wiginton (1980) to predict the credit score of the loan application [2].Neural nets solve a lot of problems because of their nonlinear capability [3][4].
This paper first calculates the application score of the customer based on the important features which are selected by using random forest classifiers and then boosting technique and neural nets are applied on the data to predict the loan status of the customer.

Application Score
Data analysis is the key step in calculating the application score of the customer. Some of the important features are the customer's annual income, debts, employment length, and loan type. Each bank develops their own scoring model based on the mentioned characteristics by adding some more too.

Feature importance
Random forest classifier is applied on the sample data for generating the importance of each feature. Figure-1 depicts some of the features which are marked as important from our data. These include employment length, application score, annual income, debts, loan type, interest rate.
The target variable in the data provided is the loan_status which describes whether the loan is approved (1) or rejected (0). Data provided is shuffled randomly. Later the train test split is applied in the ratio 80:20 where the model is built on 80% of the data and tested on the other 20%. Some of the new features are generated from the provided data. These include debt-to-income ratio, loan installment amount, monthly savings and finally application score of the candidate.

Workflow
The algorithm of the proposed model called RF-XGBoost is as follows.
Step 1: Load the sample data using a python environment. Vol. 9 Issue 07, July-2020 Step 2: Perform Exploratory Data Analysis on the data which include dropping the columns having null values more than 90%, table columns and imputing the missing valued columns.
Step 3: Apply correlation on the data and extract the dependent features among them.
Step 4: Apply random forest classifiers to extract the importance of each feature present.
Step 5: Now split the data into train and test in the ratio of 80:20. Fit the model on the training set.
Step 6: Now apply the XGBoost and Neural Net with 2 hidden layers on the training and test data set and extract the results.
Step 7: Now compare the accuracy of the 2 models built and finalise the better one.

Extreme Gradient Boosting (XGBoost)
The beauty of the XGBoost lies in its scalability and reliability. It also offers efficient memory usage. It is one of the ensemble models. Ensemble models offer a systematic solution to combine the predictive power of multiple machine learners and provides the final single result of all combined. XGBoost library implements the gradient boosting tree.

Neural Net with 2 hidden layers
In this model, our input layer consists of 16 features provided as input, two hidden layers and an output layer. Around 100 epochs are used for training the neural net. Optimiser 'adam' is used along with the loss function 'binary_cross entropy' .The activation function in the output layer used is 'sigmoid' whereas in the input and hidden layers, it is 'ReLU (Rectified Linear Unit)'. A batch size of 100 is used in the training.