 Open Access
 Authors : Amruta Aher , Rajeswari Kannan , Sushma Vispute
 Paper ID : IJERTV10IS070271
 Volume & Issue : Volume 10, Issue 07 (July 2021)
 Published (First Online): 07082021
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Data Analysis and Price Prediction of Black Friday Sales using Machine Learning Techniques
Amruta Aher Department of Computer Engineering
Pimpri Chinchwad College of Engineering, Akurdi
Pune, India
Dr. K. Rajeswari Department of Computer Engineering
Pimpri Chinchwad College of Engineering, Akurdi
Pune, India
Prof. Sushma Vispute Department of Computer Engineering
Pimpri Chinchwad College of Engineering, Akurdi
Pune, India
Abstract – Black Friday marks the beginning of the Christmas shopping festival across the US. On Black Friday big shopping giants like Amazon, Flipkart, etc. lure customers by offering discounts and deals on different product categories. The product categories range from electronic items, Clothing, kitchen appliances, DÃ©cor. Research has been carried out to predict sales by various researchers. The analysis of this data serves as a basis to provide discounts on various product items. With the purpose of analyzing and predicting the sales, we have used three models. The dataset Black Friday Sales Dataset available on Kaggle has been used for analysis and prediction purposes. The models used for prediction are linear regression, lasso regression, ridge regression, Decision Tree Regressor, and Random Forest Regressor. Mean Squared Error (MSE) is used as a performance evaluation measure. Random Forest Regressor outperforms the other models with the least MSE score.
Keywords – Regression, Linear Regression, Ridge Regression, Lasso Regression, Decision Tree Regressor, Random Forest Regressor, Mean Squared Error, Data Analysis

INTRODUCTION
The shopping sector has greatly evolved due to the Internet revolution. Most of the population takes into consideration online shopping more than the traditional method of shopping. The biggest perks of online shopping are convenience, better prices, more variety, easy price comparisons, no crowds, etc. The pandemic has boosted online shopping. Though online shopping keeps growing every year, the total sales for the year 2021 are expected to be much higher [16].
Black Friday originated in the USA and is also referred to as Thanksgiving Day. This sale is celebrated on the fourth Thursday of November once every year. This day is marked as the busiest day in terms of shopping. The purpose of organizing this sale is to promote customers to buy more products online to boost the online shopping sector.
The prediction model built will help to analyze the relationship among various attributes. Black Friday Sales
Dataset is used for training and prediction. Black Friday Sales Dataset is the online biggest dataset and the dataset is also accepted by various ecommerce websites [1].
The prediction model built will provide a prediction based on the age of the customer, city category, occupation, etc. The prediction model is implemented based on models like linear regression, ridge regression, lasso regression, Decision Tree Regressor, Random Forest Regressor.
The paper further walks through various sections. Section I gives an introduction to the problem, section II illustrates the prior research done in this field, section III provides the data set description, section IV presents the proposed model, with the conclusion in the last section.

LITERATURE REVIEW
Ample research is carried out on the analysis and prediction of sales using various techniques. There are many methods proposed to do so by various researchers. In this section, we will summarize a few of the machine learning approaches.
C. M. Wu et al. [1] have proposed a prediction model to analyze the customer's past spending and predict the future spending of the customer. The dataset referred is Black Friday Sales Dataset from analyticsvidhya. They have machine learning models such as Linear Regression, MLK classifier, Deep learning model using Keras, Decision Tree, and Decision Tree with bagging, and XGBoost. The performance evaluation measure Root Mean Squared Error (RMSE) is used to evaluate the models used. Simple problems like regression can be solved by the use of simple models like linear regression instead of complex neural network models.
Odegua, Rising [2] have proposed a sales forecasting model. The machine learning models used for implementation are KNearest Neighbor, Random Forest, and Gradient Boosting. The dataset used for the experimentation is provided by Data Science Nigeria, as a part of competitions
based on Machine Learning. The performance evaluation measures used are Mean Absolute Error (MAE). Random Forest outperformed the other algorithms with a MAE rate of 0.409178.
Singh, K et al [3] have analyzed and visually represented the sales data provided in the complex dataset from which we ample clarity about how it works, which helps the investors and owners of an organization to analyze and visualize the sales data, which will outcome in the form of a proper decision and generate revenue. The data visualization is based on different parameters and dimensions. The result of which will enable the enduser to make better decisions, ability to predict future sales, increase the production dependencies on the demand, and also regional sales can be calculated.
S. Yadav et al [4] have analyzed and compared the performance of KFold crossvalidation and holdout validation method. The result of the experimentations where kfold crossvalidation gives more accurate results. The accuracy results of K – Fold crossvalidation were around 0.1
– 3% more accurate as compared to holdout validation for the same set of algorithms.
Purvika Bajaj et al. [5] have performed sales prediction based on a dataset collected from a grocery store. The algorithms used for experimentations are Linear Regression, KNearest Neighbors algorithm, XGBoost, and Random Forest. The result precision is based on Root Mean Squared Error (RMSE), Variance Score, Training, and Testing Accuracies. The Random Forest algorithm outperforms the other three algorithms with an accuracy of 93.53%.
Ramasubbareddy S. et al. [6] have applied machine learning algorithms to predict sales. The dataset for the experimentation purpose is taken from Kaggle, named as Black Friday Sales Dataset. The algorithms used for the implementation of the system are linear regression, Ridge Regression, XGBoost, Decision Tree, Random Forest, and RuleBased Decision Tree. Root Mean Squared Error is used as the performance evaluation measure. As per RMSE lower the RMSE value better the prediction. As a result, based on the RMSE rate RuleBased DT outperforms other machine learning techniques with a RMSE rate of 2291.
Ramasubbareddy S. et al. [6] have applied machine learning algorithms to predict sales. The dataset for the experimentation purpose is taken from Kaggle, named as Black Friday Sales Dataset. The algorithms used for the implementation of the system are linear regression, Ridge Regression, XGBoost, Decision Tree, Random Forest, and RuleBased Decision Tree. Root Mean Squared Error is used as the performance evaluation measure. As per RMSE lower the RMSE value better the prediction. As a result, based on the RMSE rate RuleBased DT outperforms other machine learning techniques with a RMSE rate of 2291.
Aaditi Narkhede et al.[7] has applied machine learning algorithm in tracking sales at places like shopping center big mart to anticipate the demand of customers and handle the management of inventory accordingly the methods presented here are an effective method for data shaping and decision making. New ways that can better identify consumer needs and calculate marketing plans which will improve sales.
Decision Tree, Random Forest, Gradient Boost, and XGBoost. Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) are used as the accuracy evaluation measures. As a result of experimentation, the Random Forest performed significantly with an accuracy of 77%, with an RMSE value of 2730 and MAE value of 2349.
Decision Tree, Random Forest, Gradient Boost, and XGBoost. Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) are used as the accuracy evaluation measures. As a result of experimentation, the Random Forest performed significantly with an accuracy of 77%, with an RMSE value of 2730 and MAE value of 2349.

IMPLEMENTATION
The study uses Black Friday Sales Dataset [9] publicly available on Kaggle. The dataset consists of sales transaction data. The dataset consists of 5,50,069 rows.
The dataset consists of attributes such as user_id, product_id, martial_status, city_category, occupation, etc. The dataset definition is mentioned in Table 1.
The Black Friday Sales dataset is used for training various machine learning models and also for predicting the purchase amount of customers on black friday sales [1]. The purchase prediction made will provide an insight to retailers to analyze and personalize offers for more customer's preferred products.
SR NO
VARIABLE
DEFINITION
MASKED
1
USER_ID
UNIQUE ID OF CUSTOMER
FALSE
2
PRODUCT_ID
UNIQUE PRODUCT ID
FALSE
3
GENDER
SEX OF CUSTOMER
FALSE
4
AGE
CUSTOMER AGE
FALSE
5
OCCUPATION
OCCUPATION OF CUSTOMER
TRUE
6
CITY_CATEGORY
CITY CATEGORY OF CUSTOMER
TRUE
7
STAY_IN_CURRENT
_CITY
NUMBER OF YEARS CUSTOMER STAYS IN CITY
FALSE
8
MARITIAL_STATUS
CUSTOMER MARITAL STATUS
FALSE
9
PRODUCT_CATEGOR
Y_1
PRODUCT CATEGORY
TRUE
10
PRODUCT_CATEGOR
Y_2
PRODUCT CATEGORY
TRUE
11
PRODUCT_CATEGOR
Y_3
PRODUCT CATEGORY
TRUE
12
PURCHASE
AMOUNT OF CUSTOMER PURCHASE
FALSE
SR NO
VARIABLE
DEFINITION
MASKED
1
USER_ID
UNIQUE ID OF CUSTOMER
FALSE
2
PRODUCT_ID
UNIQUE PRODUCT ID
FALSE
3
GENDER
SEX OF CUSTOMER
FALSE
4
AGE
CUSTOMER AGE
FALSE
5
OCCUPATION
OCCUPATION OF CUSTOMER
TRUE
6
CITY_CATEGORY
CITY CATEGORY OF CUSTOMER
TRUE
7
STAY_IN_CURRENT
_CITY
NUMBER OF YEARS CUSTOMER STAYS IN CITY
FALSE
8
MARITIAL_STATUS
CUSTOMER MARITAL STATUS
FALSE
9
PRODUCT_CATEGOR
Y_1
PRODUCT CATEGORY
TRUE
10
PRODUCT_CATEGOR
Y_2
PRODUCT CATEGORY
TRUE
11
PRODUCT_CATEGOR
Y_3
PRODUCT CATEGORY
TRUE
12
PURCHASE
AMOUNT OF CUSTOMER PURCHASE
FALSE
TABLE I. DATASET DEFINITION
M.Sahaya Vennila et al. [8] have analyzed, preprocessed, and applied machine learning techniques to predict sales. The dataset used for the analysis and experimentation purpose is Black Friday Sales Dataset from Kaggle. The dataset is preprocessed. K – Fold method is used for the purpose of splitting the dataset into training and testing datasets. The prediction model is implemented using Linear Regression,
M.Sahaya Vennila et al. [8] have analyzed, preprocessed, and applied machine learning techniques to predict sales. The dataset used for the analysis and experimentation purpose is Black Friday Sales Dataset from Kaggle. The dataset is preprocessed. K – Fold method is used for the purpose of splitting the dataset into training and testing datasets. The prediction model is implemented using Linear Regression,
The Purchase Variable will be the predictor variable. The Purchase Variable will predict the amount of purchase made by a customer on the occasion of black friday sales.
As mentioned in the introduction, the proposed approach tries to implement the machine learning models such as Linear Regression, Ridge Regression, Lasso Regression, Decision Tree Regressor, and Random Forest.
Regressor to forecast sales. Figure 1 depicts the flow of data through the proposed model.
Exploratory Data Analysis has been performed on the dataset [5]. The tools used for the data analysis are python, pandas, matplotlib, NumPy, array, seaborn and jupyter notebook.
Fig. 1. Flowchart of Proposed System
The Black Friday Sales Dataset is the input dataset. Data visualization of the various attributes of this dataset is performed.
Data preprocessing which mainly includes filling missing values is performed. The categorical values are label encoded to numeric form. The categories such as Gender where F represents female and M represents Male is converted to numerical form as 0 and 1 also other categorical values such as City_Category, Stay_InCurrent_City, Age are converted to numerical form by applying Label Encoding.
The attributes such as User_id and Product_id are removed to train the model with no bias based on user_id or product_id and to achieve better performance.
The algorithms used for implementing the system are linear regression, Ridge Regression [19], Lasso Regression [19], Decision Tree Regressor, and RandomForest Regressor. The models are trained using 5 fold crossvalidation [4][12]. The performance evaluation measure used is Mean Squared Error (MSE).
Random Forest Regressor performs better than the other algorithms with a MSE score of 3062.719.

DATASET VISUALIZATION
Heatmap is used for determining the correlation between dataset attributes. The data of a given dataset can be easily represented graphically by using a Heatmap. It uses a color system to represent the correlation among different attributes. It is a data visualization library (Seaborn) element.
Heatmap color encoded matrix can be described as lower the intensity of the color of an attribute related to the target variable, higher is the dependency of target and attribute variables.
Based on the Black Friday Sales Dataset [9] the heatmap obtained gives output as Figure 2. The observation based on the heatmap is the attributes age and marital_status, product_category_3 and purchase have a correlation.
Fig. 2. Heatmap for correlation between attributes
The count plots for different attributes are visualized as different figures given below. The count plot for gender attributes is as Figure 3. Based on the count plot for gender attribute it is observed that feature M (Male) has the maximum count. The count for F features is less.
Fig. 3. Count Plot for Gender.
The count plot for the age attribute is as Figure 4. Based on the count plot the observations noted are the age group 2635 has a maximum count. The second maximum count observed is for the age group 3645. The third maximum count observed is for the age group 1825.
Fig. 4. Count Plot for Age.
The count plot for the occupation attribute is as Figure

The observation based on the count plot is that the masked occupation 4 has maximum count. The second maximum based on the count plot is occupation 0.
Fig. 5. Count Plot for Occupation
The count plot for city_category is as given in Figure

The count plot depicts the maximum count for category B. The second maximum count is for category C. The minimum count is for category A.
Fig. 6. Count Plot for City_Category
The count plot for Stay_In_Current_City is as given in Figure

The observations based on the count plot can be stated as the maximum count is for 1 year. The minimum count is for 0 years.
Fig. 7. Count Plot for Stay_In_Current_City_Years


RESULTS & DISCUSSION

Linear Regression.
Linear Regression is one of the supervised machine learning algorithms [17]. A regression problem can be stated as a case when the output variable is continuous [10]. Linear regression predicts a dependent variable (y) based on a given independent variable (x). The model depicts a linear relation among the variables. Function for linear regression is:
Y = 1 + 2 .x
Here, the input variable is x, the output value is y and 1 represents intercept and 2 represents the coefficient of x [13]. This algorithm aims to calculate and find the best fit line to target variable and independent variable.
The features which majorly affect the linear regression model are depicted in Figure 8.
Fig. 8. Attributes affecting Linear Regression

Ridge Regression.
Multiple regression data can be analyzed using Ridge Regression [17][18][19]. Least Square estimates are unbiased when multicollinearity occurs. Based on the degree of bias it reduces the standard errors that is added to the regression estimates. The formula for the ridge regression is [18]:
The attributes affecting the ridge model for the given dataset are depicted in Figure 9.
Fig. 9. Attribute affecting Ridge Regression

Lasso Regression.
Lasso Regression provides both variable selection and regularization [17]. It makes use of soft thresholding. Only a subset of the covariates provided is select for use in the final model in the case of Lasso regression. It can be denoted as [18]:
The attributes affecting lasso regression for the given dataset are as shown in Figure 10.
Fig. 10. Attribute affecting Lasso Regression

Decision Tree Regressor.
The Decision Tree model builds a treelike structure for regression or classification models [15]. The dataset is simply broken down into smaller subsets. In a DT the control statements or values are a basis for branching, and the splitting node contains data points on either side, depending on the value of a specific attribute. The attribute selection measure plays an important role in root node selection.

Information Gain: The splitting attribute is calculated based on the amount of information required to describe the tree. The formula for the same is [17]:

Gain Ratio: Attributes that have a large number of values are selected by this attribute selection measure [17].

Gini Index: The formula for calculating the Gini index is as given below [17]:
Where the probability that a tuple in D belongs to class Ci is defined by pi and is estimated by Ci,D/D. The sum is computed over m classes.
The attributes that majorly affect the decision tree models are shown in Figure 11.
Fig. 11. Attribute affecting Decision Tree Regressor


Random Forest Regressor
Random Forest being an ensemble technique is capable of both tasks namely regression and classification. The Random Forest is based on the ideology of combining multiple DT rather than single DT dependency [17].
The attributes that affect the Random Forest Regressor are as given in Figure 12.
TABLE II. COMPARATIVE ANALYSIS
Model
MSE
Linear Regression
4617.99
Ridge Regression
4687.75
Lasso Regression
4694.14
Decision Tree Regressor
3363.87
Random Forest Regressor
3062.72


CONCLUSION
With traditional methods not being of much help to business growth in terms of revenue, the use of Machine learning approaches proves to be an important point for the shaping of the business plan taking into consideration the shopping pattern of consumers.
Projection of sales concerning several factors including the sale of last year helps businesses take on suitable strategies for increasing the sales of goods that are in demand.
Fig. 12. Training and Validation Accuracy of DarkNet 53
D. Comparative Analysis
The comparison between the MSE rates of all algorithms is depicted in Table 2 below.
Based on Table 2 it can be observed that Random Forest Regressor gives better performance with comparison to other machine learning models namely linear regression and Decision tree regressor.
The MSE rate of Random Forest Regressor is 3062.72 and hence it is more suitable for the prediction model to be implemented.
Thus the dataset is used for the experimentation, Black Friday Sales Dataset from Kaggle [9]. The models used are Linear Regression, Lasso Regression, Ridge Regression, Decision Tree Regressor, and Random Forest Regressor. The evaluation measure used is Mean Squared Error (MSE). Based on Table II Random Forest Regressor is best suitable for the prediction of sales based on a given dataset.
Thus the proposed model will predict the customer purchase on Black Friday and give the retailer insight into customer choice of products. This will result in a discount based on customercentric choices thus increasing the profit to the retailer as well as the customer.

FUTURE WORK
As future research, we can perform hyperparameter tuning and apply different machine learning algorithms.
REFERENCES

C. M. Wu, P. Patil and S. Gunaseelan, "Comparison of Different Machine Learning Algorithms for Multiple Regression on Black Friday Sales Data," 2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS), 2018, pp. 1620, doi: 10.1109/ICSESS.2018.8663760.

Odegua, Rising. (2020). Applied Machine Learning for Supermarket Sales Prediction.

K. Singh and R. Wajgi, "Data analysis and visualization of sales data," 2016 World Conference on Futuristic Trends in Research and
Innovation for Social Welfare (Startup Conclave), 2016, pp. 16, doi: 10.1109/STARTUP.2016.7583967.

S. Yadav and S. Shukla, "Analysis of kFold CrossValidation over HoldOut Validation on Colossal Datasets for Quality Classification," 2016 IEEE 6th International Conference on Advanced Computing (IACC), 2016, pp. 7883, doi: 10.1109/IACC.2016.25.

Purvika Bajaj1, Renesa Ray2, Shivani Shedge3, Shravani Vidhate4, Prof. Dr. Nikhilkumar Shardoor5,SALES PREDICTION USING MACHINE LEARNING ALGORITHMS'',International Research Journal of Engineering and Technology (IRJET) ,Vol 7 Issue 6,2020,e
ISSN: 23950056  pISSN: 23950072

Ramasubbareddy S., Srinivas T.A.S., Govinda K., Swetha E. (2021) Sales Analysis on Back Friday Using Machine Learning Techniques. In: Satapathy S., Bhateja V., Janakiramaiah B, Chen YW. (eds) Intelligent System Design. Advances in Intelligent Systems and Computing, vol 1171. Springer, Singapore. https://doi.org/10.1007/978 9811554001_32

Aaditi Narkhede, Mitali Awari, Suvarna Gawali, Prof.Amrapal Mhaisgawali " Big Mart Sales Prediction Using Machine Learning Techniques" International Journal of Scientific Research and Engineering Development (IJSRED) Vol3Issue4  693697.

M.Sahaya Vennila; Holy Cross College, Nagercoil. Affiliated to Manonmaniam Sundaranar University, Tirunelveli 627 012,Page No:133136,doi.org/10.37896/whjj16.05/037

Black Friday Sales Dataset Kaggle https://www.kaggle.com/kkartik93/blackfridaysales prediction?select=train.csv

What is Regression and Classification in Machine Learning? https://www.geeksforgeeks.org/regressionclassificationsupervised machinelearning

Calculating MSE https://www.dataquest.io/blog/understanding regressionerrormetrics/

CrossValidation of a model https://towardsdatascience.com/whyand howtocrossvalidateamodeld6424b45261f

Linear Regression https://www.geeksforgeeks.org/mllinearregression/

Ridge Regression https://machinelearningmastery.com/ridge regressionwithpython/

Decision Tree https://www.saedsayad.com/decision_tree_reg.htm

Potturi, Keerthan, "Black Friday A study of consumer behavior and sales predictions" (2021). Creative Components. 784. https://lib.dr.iastate.edu/creativecomponents/784

Jiawei Han, Micheline Kamber and Jian Pei, Data Mining Concepts and Techniques,Third edition, MK Publications, 2009.

Ridge and Lasso Regression https://www.geeksforgeeks.org/typesof regressiontechniques/

Ridge and Lasso https://machinelearningmastery.com/ridgeregression withpython/