Rainfall Prediction using Linear approach & Neural Networks and Crop Recommendation based on Decision Tree

Download Full-Text PDF Cite this Publication

Text Only Version

Rainfall Prediction using Linear approach & Neural Networks and Crop Recommendation based on Decision Tree

Deepali Patil

Shree L.R. Tiwari College of Engineering Asst.Professor/Head of Department

Shakib Badarpura

Shree L.R. Tiwari College of Engineering

Abhishek Jain

Shree L.R. Tiwari College of Engineering

U. G Student

Aniket Gupta

Shree L.R. Tiwari College of Engineering

U. G Student

Abstract Rainfall is one of the most vital components of agriculture and also predicting it is the most challenging task. In general, weather and rainfall are highly non-linear and complex phenomena, which require progressive computer modeling and simulation for their precise prediction. Numerous and diverse machine learning models are used to predict the rainfall which are Multiple Linear Regression, Neural networks, K-means, Naïve Bayes and more. These systems implement one of these applications by extracting, training and testing data sets and finding and predicting the rainfall. This study shows the using of Multiple Linear Regression and Neural networks to predict rainfall and Decision Trees algorithm to recommend crops. Thus, we inferred that we can predict the rainfall and recommend crops with reasonable accuracy.

Keywords Neural networks, Linear Regression, Decision Tree, Rainfall, Crop Recommendation, Machine Learning.


    The Rainfall Prediction model is implemented by using two Algorithms which are Multiple Linear Regression and Neural Networks. Rainfall Prediction using Linear Regression and Neural Networks is to find the correlation between diverse features in dataset which contributes to Rainfall and to find correct weights and Biases which leads to accurate Prediction of Rainfall respectively. Initially, the dataset with multiple features is cleaned and pre-processed to make it suitable for use and feed it into machine learning algorithm. A correlation matrix is created which shows the correlation between different independent variables and the dependent (Predictor) variable. Once the data is cleaned and processed, we can build our Multiple Linear Regression and Neural Network Model and fit it on our data. The main objective of our system is to predict the Rainfall based on different features like humidity, temperature, pressure etc. with reasonable accuracy. Agriculture and Rainfall are highly correlated with each other. As the technology evolved, developments were made in different sectors including agriculture. Many IT Giants started to provide information related to weather such as temperature, rainfall, humidity, etc. which can be used by farmers in agriculture. different sectors including agriculture. Therefore, we also have developed a Crop Recommendation System which will Recommend Crops to the user based on different inputs provided. This in turn will help oneself

    (farmer) to have a better idea about the irrigation and types of crops to be grown.


    [2] A. Kala, Dr. S. Ganesh Vaidyanathan presents an algorithm based on Artificial Neural Network (ANN) such as Feed Forward Neural Network (FFNN) model is built for predicting the rainfall. Artificial neural networks (ANN) are the valuable and alluring soft computing method for prediction. ANN is based on self-adaptive mechanism in which the model learns from historical data capture functional relationships between data and make predictions on current data. The accurate prediction of rainfall is a major criterion for managing the water resources. The prediction accuracy is measured using confusion matrix and RMSE. The results show that the prediction model based on ANN indicates acceptable accuracy.

    [7] Swain, S., Patel, P., & Nandi, S. presents a paper which demonstrates, a multiple linear regression model has been developed to reckon annual precipitation over Cuttack district, Odisha, India. The model forecasts precipitation for a year considering annual precipitation data of its three preceding years. The model testing was performed over a century-long dataset of annual precipitation i.e. for 1904-2002. Assuming the intercept or constant of the multiple linear regression model as zero, the equation developed thereby displayed a superb result. The model predictions showed an excellent association with the observed data i.e. the coefficient of determination (R 2) and adjusted R 2 value was obtained to be 0.974 and 0.963 respectively. This reconciliation justifies the application of the developed model over the study area to forecast rainfall, thereby aiding in proper planning and management

    [4] Thirumalai, C., Harsha, K. S., Deepak, M. L., & Krishna,

    K. C. carried on the heuristic prediction of rainfall using machine learning techniques. Agriculture is the major part of our country and economy. A steady rain pattern generally plays an essential role for healthy agriculture but too much rainfall or too little rainfall can be harmful, even it led to devastating of crops. The paper even includes the rate of

    rainfall in past years according to various crops seasons and predicts the rainfall for further seasons. The paper also measures the various categories of data through linear regression technique in metrics for efficient understanding of agriculture in India. It consists of a real dataset of past years rainfall rate based on various seasons. Results of their application help farmers to make a correct decision to harvest a particular crop accordingly to crops seasons.


    1.) Rainfall Prediction

    A. Dataset Used.

    The dataset for Rainfall prediction knows as Austin weather Dataset was collected from Kaggle. The dataset contains many features which includes temperature, humidity, pressure, dew point, visibility etc. The data was having irregularities and hence were removed in the data preprocessing step. The dataset for crop recommendation was taken from open source GitHub repository. This dataset contains different features like temperature, humidity, rainfall, pH value and the crops that grown under particular values of these features.

    1. Data Cleaning and Pre-Processing

      Data cleansing is the process of detecting and correcting inaccurate or outlier records from a dataset and then replacing, modifying, or deleting the wrong data which can affect accuracy of our model. In our case, the data has few days where the required factors werent recorded and the rainfall in centimeters was marked as T if there was trace precipitation.

      Figure 01. Missing and Wrong Data

      Since the algorithm requires numbers, we cant work with alphabets popping up in the data. Hence, we need to clean the data before applying it on our model. This data is not suitable for our model and hence we converted them into such values which can be used in our model and also transformation of which doesnt affect the output.

    2. Finding Co-relation and Co-relation coefficient Correlation can be defined as statistical measure that shows the extent to which two or more variables are dependent on each other. A positive correlation shows that if one variable increases the other also increases while a negative correlation

      indicates if one variable increases the other decreases. A correlation coefficient is a statistical measure

      which can be used to know which features in dataset are more related or dependent on output variable and thus helps in feature selection.

      rxy = (1)

      A correlation matrix aids us to identify the features or independent variables which are highly correlated and neglect those which are not correlated thushelping us to decrease the complexity of our model.

      Figure 02. Correlation Coefficients of Data

    3. Normalization (Scaling of Data)

      Scaling or Normalization is a method used to normalize the range of independent variables or features of data. Data Normalization is usually performed during the data preprocessing step. Normalizing the data helps the model to be less complex as all the values are converted between a particular range of values. In our case we normalized the data in a range of -1 to 1. The following formula is used for normalization.


      We Normalized or scaled the data using the formula mention above. We used the R inbuilt function Scale to normalize our train data. As of normalizing the test data we cannot normalize it using the scale function as it would consider the Mean and Standard deviation of whole data to scale it. As we cannot give our model any prior knowledge of any kind of data whether it be Mean or S.D, we normalized the test data by separately calculating its mean and S.D and then using the above formula to find its equivalent normalized value.

      Note: We normalized the data only for Neural Networks

      Figure 03. Normalized or Scaled Data

    4. Building a Model (Multiple Linear Regression and Neural Networks)

    1. Multiple Linear Regression

      In statistics, linear regression can be defined as linear approach to demonstrate the correlation between a dependent variable and one or more independent variables. In the case if there is only one independent variable, it is called as simple linear regression. If there are more than one independent variable, the process is called multiple linear regression. This term has different meaning from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single variable. Linear regression has many applications like prediction, forecasting, or error reduction. A predictive model can be fitted using Linear regression with the collected dataset or observed values. Once the model is fitted, it can be used to make predictions. The multiple linear regression equation is as follows:

      Once the model is fitted, it can be used to make predictions. The multiple linear regression equation is as follows:

      = b0 + b1X1 + b2X2 + .. + bpXp (3)

      where y is the predicted or expected value of the dependent variable which is rainfall in our case, X1, X2…Xp are different independent or predictor variables like temperature, humidity etc. When all of the independent variables like temperature, humidity…Xp are equal to zero we get b0 which is value of Y. The estimated regression coefficients are represented as b1, b1…bp. The regression coefficient denotes the change in rainfall (y ) relative to a one-unit change in the independent variables. In our case that is in the multiple regression situation, b1 can be defined as the change that happens when there is a unit change in X1, keeping all other independent variables constant.

      Figure 04. Multiple Linear Regression

      In our case the dependent variable is Rainfall (Precipitation) and the independent variables are humidity, temperature, pressure and others. The days are represented on X-axis and scale of features or independent variables is denoted on Y- axis. In the above graph we can observe that the rainfall can be high when the temperature is high.

    2. Neural Networks

    Artificial Neural Networks is one of the most popular machine learning and deep learning algorithms. They are inspired by human neurons which are capable of making human like decisions with help of computations. For example, in our case we trained the Neural Networks with different features like humidity, temperature, pressure etc. and they learn to identify and analyze the rainfall based on these features using the results of training dataset. The very simple neural network might contain only one input neuron, one hidden neuron, and one output neuron. It takes several dependent variables = input parameters, multiplies them by their coefficients = weights, and runs them through a ReLU activation function and a unit step function.

    oj = f ( wi, jai + bi) (4 )

    In our case the input layer will contain the number of neurons equal to the input features.

    Figure 05. Neural Networks

    The inputs will be multiplied with weights and then forwarded to the hidden layer for further computation. An activation function will be used which is discussed further. We have only one output layer as we have to predict only one variable which is Rainfall.

    Activation Function (Rectified Linear Unit ReLu)

    The ReLU stands for Rectified linear Unit. it is widely used in deep learning. As you can see, the ReLU activation function is half rectified. f(z) is 0 when z is less than zero. It converts all negative values to 0 and f(z) is equal to z when z is positive or 0.

    The activation function can be denoted as:

    R(z)=Max (0 to infinity) (5)

    Range: [ 0 to infinity)

    Figure 06. Graph of Activation Function (ReLU)

    When a neural network is trained on the training set, it is initialized with a set of weights. Optimization of weights is done during the training period and ideal weights are produced. A weighted sum of inputs is produced by the neuron as mentioned below.

    Y = (weight * input) + bias (6) let us consider, if the inputs are: x1, x2. xn

    And the weights are represented as: w1, w2. wn

    At the end, the computed value is given to the activation function which is ReLU in our case which prepares the output.

    Activation function (x1w1 + x2w2 + .. + xnwn + bias) (7)

    2.Crop Recommendation

    A. Decision Tree Regression

    Decision Tree is a machine learning algorithm that uses a flowchart-like tree structure or can be like a model consisting of decisions and all of their possible results, including outputs, input costs and utility. It is one type of the supervised learning algorithms. Decision trees can be used for both types of output variable like categorical or continuous.

    Decision tree regression algorithm spots features of an object. Decision tree regressor trains a model in a tree like formation and predicts the data for future to have meaningful continuous output. The meaning of continuous output means that it is not denoted by known set of values or numbers.

    Discrete output example: A decision tree regressor model which predicts whether there will be rain tomorrow or not.

    Continuous output example: A decision tree regressor model which can predict the profit of a company from the sales of a particular kind of product.

    The decision trees use the core algorithm named as ID3 which uses a top- down approach. It uses greedy search through the branches of the decision tree with no backtracking available. With the help of information gain (I.G) and standard deviation (S.D), we can use ID3 algorithm to implement Decision tree regressor.

    Information Gain

    The secondary name for Information gain is Kullback-Leibler divergence which is represented by IG (S, A) for a set S. Information Gain can be easily defined as the effective variation in entropy after deciding on a particular attribute A. With respect to independent variables I.G calculates or measures the relative change in entropy.

    IG (S, A) = H (S) H (S, A) (8)

    Standard Deviation

    In a decision tree algorithm, the data is partitioned into smaller subsets which contains instances with similar/ homogenous values. It starts building from the root node and then data partition happens. To calculate the homogeneity of the sample we use a measure known as standard deviation. A complete homogenous numerical sample has a S.D = 0.

    S (T, X) = P(c) S (c)

    cX (9)

    Standard Deviation Reduction

    The decrease in standard deviation (S.D) after a data is split on an attribute is called as standard deviation reduction. The main aim of developing a decision tree model is to ind such an attribute which returns the highest reduction in standard deviation (S.D).

    Step 1: The S.D of the target is calculated.

    Step 2: The dataset is then divided on basis of the different attributes. The S.D for each branch is calculated. Finally, we subtract the resulting S.D from the S.D before the split. The result is the S.D reduction.

    SDR (T, X) = S (T) S (T, X) (10)

    Step 3: The attribute having the largest standard deviation (S.D) reduction is chosen for the decision node.

    Step 4: The dataset is partitioned based on the values of the selected attribute. Until all data is processed this process is run recursively on the non-leaf branches.


    1. Python

      We used python as our programming language to implement machine learning algorithms. The main reasons for choosing python are mentioned below:

      • Numerous Machine Learning Libraries

      • Data Visualization libraries and functions

      • Software Development

      • easy to use for Mathematics and Data Analytics


      • Python can work on different platforms like Windows, Mac, Linux, Raspberry Pi, etc.

      • It has simple syntax.

      • With python and its libraries, we can implement program in fewer lines as compared to other languages.

      • Python is an interpreted language.

      • We can use pre-trained models such as VGG16, ResNet, inception networks etc.

      • It reduces cognitive Load.

      • It has simple and consistent API's.

      • It reduces the number of Human Action required for common computations. etc.

    1. NumPy and Pandas

      For scientific computations and to work with high- performance arrays and matrices we have used NumPy. It is very useful for processing multidimensional arrays and is an open source library. Additionally, we have used Pandas package for data manipulation.

    2. Sklearn

      Scikit-learn or sklearn is an open source and free machine learning library available in python. With the help of sklearn library we can incorporate diverse classification, regression and clustering algorithms. In our case we tend to use one of its many algorithms named as Decision Tree Regressor.

    3. R and RStudio

    R is another popular programming language in field of machine learning. We can implement variety of linear/non- linear operations, classification etc. and also neural networks using its inbuilt libraries and packages. We have used R programming to develop a Neural network model to predict the Rainfall. Libraries like Keras and TensorFlow are available in R and hence were used according

    1. TensorFlow

      Figure 07. Python IDE (Spyder).

      We used RStudio which is an IDE for R programming language. One can use RStudio in two formats a desktop application or via a web browser. We have used the desktop version of R studio to perform and build our model.

      We took help from one of the most used and popular machine learning libraries which is TensorFlow.

      TensorFlow architecture works in three parts:

      • Preprocessing the data

      • Build the model

      • Train and estimate the model

        TensorFlow library has numerous machine learning and neural machine computation functions which can aid to make complex task easy. Many reasons to use TensorFlow are as follows:

      • In neural machine translations, it can help to reduce errors up to 50%.

      • It provides flexibility and multi-level abstraction.

      • It helps to train the model faster and also aids to run more experiments.

        TensorFlow library provides numerous different API to build deep learning architecture like CNN or RNN.

    2. Keras

    We used Keras for building our neural networks model. It is a high-level neural network API which can be used with other libraries such as TensorFlow or Theano. The main advantages of using Keras for building neural networks are as follows:


    We fed multiple inputs such as temperature, humidity, wind speed into our two Machine learning models viz Multiple Linear Regression Neural Networks and computed the output as shown below

    We fed our manual input to the model in the form of an array to get the output from the model

    The output for Prediction of Rainfall using Multiple Linear Regression is as below.

    Figure 08. Output of MLR

    The output from our Neural Network model shows the mean absolute error which was 0.1459216. Heres a snapshot of it.

    Figure 09. Output of Neural Network

    The Accuracy of Decision Tree came out to be 90-92%.

    The output for Crop Recommendation using Decision Tree Regressor is as shown below:

    Figure 10. Output of Decision tree


Hence, using machine learning techniques like Multiple Linear Regression and Neural Networks we can predict the rainfall with considerable accuracy. However, the accuracy of the model depends on the type of data that has been fed. Modifications needs to be done when data of different structure and type is used to predict the rainfall. Additionally, with the help of multiple input features like Rainfall, Humidity, temperature we also recommended crops that can be grown using another popular machine learning technique known as Decision Tree. The accuracy of the decision tree was quite satisfying and could aid many farmers to make better decisions.


  1. Deepti Gupta, Udayan Ghose, A Comparative Study of Classification Algorithms for Forecasting Rainfall, 4th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions).

  2. A. Kala, Dr. S. Ganesh Vaidyanathan, Prediction of Rainfall using Artificial Neural Network, Proceedings of the International Conference on Inventive Research in Computing Applications (ICIRCA 2018).

  3. Inyaem, U. (2018). Construction Model Using Machine Learning Techniques for the Prediction of Rice Produce for Farmers. 2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC). doi:10.1109/icivc.2018.8492883

  1. Thirumalai, C., Harsha, K. S., Deepak, M. L., & Krishna, K. C. (2017). Heuristic prediction of rainfall using machine learning techniques. 2017 International Conference on Trends in Electronics and Informatics (ICEI).

  2. Abishek, B., Priyatharshini, R., Eswar, M. A., & Deepika, P. (2017). Prediction of effective rainfall and crop water needs using data mining techniques. 2017 IEEE Technological Innovations in ICT for Agriculture and Rural Development (TIAR).

  3. Suvidha Jambekar ; Shikha Nema ; Zia Saquib 2018. Prediction of Crop Production in India Using Data Mining Techniques.Fourth International Conference on Computing Communication Control and Automation (ICCUBEA)

  1. Swain, S., Patel, P., & Nandi, S. (2017). A multiple linear regression model for precipitation forecasting over Cuttack district, Odisha, India. 2017 2nd International Conference for Convergence in Technology (I2CT). doi:10.1109/i2ct.2017.8226150

  2. Kulkarni, N. H., Srinivasan, G. N., Sagar, B. M., & Cauvery, N. K. (2018). Improving Crop Productivity Through A Crop Recommendation System Using Ensembling Technique. 2018 3rd International Conference on Computational Systems and Information Technology for Sustainable Solutions (CSITSS).

  3. B.Vasantha1, R.Tamilkodi2, L.Venkateswara kiran3 rainfall pattern prediction using real time global climate parameters through machine learning 2019 International Conference on Vision Towards Emerging Trends in Communication and Networking (ViTECoN)

  4. Nigam, A., Garg, S., Agrawal, A., & Agrawal, P. (2019). Crop Yield Prediction Using Machine Learning Algorithms. 2019 Fifth International Conference on Image Information Processing (ICIIP. doi:10.1109/iciip47207.2019.8985951

  5. Ahuna, M.N., Afullo, T.J. & Alonge, A.A. 2019. Rain attenuation prediction using artificial neural network for dynamic rain fade mitigation.

  6. Kishor, R. V., Shatrughan, K. P., Balasaheb, M. K., Sadashiv, M. B., Sachin, V., Gaike, V. V., & Seetamraju, M. (2018). Agromet Expert System for Cotton and Soyabean Crops in Regional Area. 2018 International Conference On Advances in Communication and Computing Technology (ICACCT).

  7. https://www.kaggle.com/grubenm/austin-weather (Dataset Download)

Leave a Reply

Your email address will not be published. Required fields are marked *