Imputing Missing Data in Hydrology using Machine Learning Models

Missing data has been a common problem and has been confronted by many researchers in the field of hydrology. Rainfall and Temperature time series data are often found missing and such missingness have huge implication on hydrological modelling, flood frequency analysis, trend analysis and dam operation schemes. Owing to the presence of missing data it hinders the performance analysis of the data and inhibits in concluding the correct inferences from the data. In this study, missing data in the rainfall and temperature has been imputed using kNN model and Tree-based model and subsequently these imputed data have been used as predictors to predict the river flow data using Artificial Neural Network (ANN). Uncertainty from kNN imputation model has been found with bootstrapping techniques, while the tree based and ANN model were assessed by Root Mean Square Error (RMSE) and Mean Absolute Error


INTRODUCTION
The long-term hydro-meteorological variables can be utilized for understanding the regional weather and climate and can also be used for the vulnerability assessment of water resource within the region or community of interest [1]. Such data can also be used for planning and managing the water resource at the basin level using different physical based models like hydrological and hydraulic models. However, such variables are often confronted with missing data which makes the analysis difficult or sometimes makes it impossible to analyse. The missingness is ubiquitously introduced owing to defect in the recording sensors, during relocation of the sensors and errors made while noting or observing the data. Due to the presence of the missing rainfall and flow data, it becomes increasingly difficult to calibrate and validate the hydrological model for a basin. Furthermore, robust and complete data is of utmost important for regional flood frequency analysis, hydraulic design and Dam operation schemes. Rainfall and temperature are the key environmental variables that are used to understand the different atmospheric, cryosphere and climatic processes within any region of interest. In absence of complete observed data, many studies [2] [3] resort to remote sensing-based data which are model based coupled with station data. However, such data has higher uncertainty and bias introduced when downscaling to station data. Ignoring the missing data of one variable usually means compromising the observed data of other variables, which results in drastic loss of overall data. Such case is usually confronted when using statistical models like Multiple Linear Regression Model (MLRM) which assumes the linearity among the different observed variables, where corresponding observed value of one variable is dropped for the missingness of other variable which thereby results in the loss of statistical inferences of the data. On the other hand, physical-based models like hydrological models per se can be used to impute the missing river flow data provided that the model is well calibrated and validated using historically observed data. However, finding such long-term historically observed data is not only difficult for an ungauged basin but the data per se tends to have missing data. Therefore, it necessitates to impute the missing data preferably using some contemporary Machine Learning models. Machine Learning models like Artificial Neural Network is basically evolved based on the working method of human brain for classification, identification and recognition. Primarily developed for the medical and neurological study [4], it now find its practice in myriads of disciplines. Although Machine learning concepts were developed as early as 1950s, the applicability has indeed progressed only in recent decades owing to technological advancement in computational power of a computer. Researchers now are able to leverage such technologies to develop complex machine learning models which have a vast usage in different application areas. However, imputing missing values disparagingly depends on the nature of missing data which is described as follows [5] [10] has used machine learning models for estimating flow missing data and have achieved reliable accuracy. [11] compared the Machine Learning and Hydrological models as the imputation model and found that Machine learning performs better in imputing missing data. ANN can be used as an alternative to different environmental and physical-based models used for Rainfall-Runoff

International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181 http://www.ijert.org modelling, ground water modelling, water quality modelling and flow modelling and it is indeed found more accurate [12]. Nevertheless, [13] also consider the Normal Ratio Method (NRM) and Inverse Distance Weighted Method (IDM) for imputing missing rainfall data. Although, such method can reconstruct the data at lower frequency (i.e. annual time series), it has to resort to percentage contribution of nearest rainfall station to annual rainfall in order to convert to higher frequency data (i.e. daily or hourly time series). [14] has used Recurrent Neural Network (RNN) for imputing the missing data of a time series and such model achieved reasonable accuracy providing useful information of the data. Studies like [15][16] has provided wider perspectives of using machine learning models for the analysis of water quality data. Further, [17] showcased the use of ANN and Support Vector Machine (SVM) to predict the nonlinear time series like ground water level and found that SVM performed slightly better than the ANN model, nevertheless both the models well represented the nonlinearity of the data. [18] studied predictability of a flow using kNN and ANN model for different scenarios and advocated that kNN offers better predictability of flow. [19] used regression trees and ANN to reconstruct the missing rainfall data which provided promising streamflow prediction using hydrological models such as Soil Water Assessment Tool (SWAT). [13] and [20] advocated the use of random forest based decision trees to reconstruct the missing values in rainfall data while [21] used sequential imputation considering random forest technique. In this study, multiple linear regression model, kNN imputation and decision tree-based imputation were used to impute the missing data in rainfall and temperature. MLRM was assessed based on the regression coefficient while uncertainty of kNN imputation were carried out using bootstrapping techniques and decision trees were assessed with Root Mean Square Error. Further, these imputed data were used as predictors to predict the flow in the two gauging station located in the basin using ANN considering back propagation technique. The choice of predictors used for the predicting is solely dependent on the background knowledge of the user and its relationship with the response variable.
II. DATA AND METHODS The flow data, rainfall and temperature data has been obtained from National Centre of Hydrology and Meteorology (NCHM), Thimphu, Bhutan. Rainfall and temperature data from six meteorological station has been used which fall in the Kholongchu basin located in the eastern region of the Himalayan Kingdom of Bhutan. There are also two flow gauging station as shown in the figure 1. In the Rainfall and temperature data, there were few missing data and these missing data has been imputed using the approaches stated below. Having the missing data imputed and its efficiency validated, these data has been used as a predictor to predict the flow in the basin.

A. Imputation with Multiple Linear Regression Model (MLRM)
In this method, missing values in one station (response variable) was imputed with regressing with the multiple other station (independent variables) where data was complete. Months (a categorical variable) were also used as an independent variable for imputing the missing data. R-package by [22] has been used to impute the missing values. Once the missing data in the station of interest was imputed, it was subsequently treated as independent variable to impute the missing data in remaining stations. Mathematically, MLRM for rainfall were performed in the following way: Where initially rainfall in Kanglung(klung) and Trashi Yangtse (tyang) station was complete. Similarly, MLRM for maximum and minimum temperature were also performed B. Imputation with k-Nearest Neighbours (kNN model) kNN uses the distance-weighted aggregation techniques where it aggregates the values from the neighbours to obtain the replacement for a missing value. It does so using the weighted mean, where weights are inverted distances from each neighbour. Closer neighbour has more impact on the imputed values. It is often a good practice to sort the variables increasingly by number of missing values before performing kNN imputation. Here, kNN imputation was performed considering 5 (k=5) nearest neighbourhood using R-package developed by [23].

Uncertainty of kNN imputation model with bootstrapping techniques
Whenever analysis or modelling is performed on imputed data, uncertainty from imputation should be adequately accounted. Running a model or performing an analysis on onetime imputed data ignores the fact that imputation estimates the missing values with uncertainty. The solution to this is multiple imputation and one way to implement is by bootstrapping. Bootstrapping is one technique where data are sampled with replacement to get the original data. It is a technique to get the inference of a population data using a sample data. It works with MCAR and MAR data. Here multiple imputation with 1000 boot replicates were generated where each boot replicate represents the regression coefficients calculated as per eqn. 5. Subsequently standard error and bias associated with the replicated and original data were assessed.

C. Tree based imputation with random forest
It is based on the non-parametric approach where no assumption is made on the relationship between variables. It can pick up the complex nonlinear patterns and it is often better than the statistical models. Tree based imputation uses random forest behind the hood and builds separate random forest to predict the missing values for each variable one by one. In this study, it utilizes Miss Forest imputation algorithm in Renvironment developed by [24], where in the first iteration, missing data is initially imputed with mean of the data and then for each variable containing missing values, it fits a random forest based on the non-missing values and then later predicts the missing values. The iteration continues to repeat until it reaches a stopping criterion or meets the user-specified iteration number. The algorithm also gives the Out-of-Bag (OOB) error associated with the imputation and hence there is no need to evaluate its efficiency separately. Here the error has been minimized by taking a 1000 decision trees. Increasing the decision tree might improve the imputation model, but it will also require higher computation time, therefore, there is always a speed-accuracy trade-off to be made during computation.

D. Artificial Neural Network (ANN)
In this method, imputed data from the kNN model were considered as an input vectors for the neural network along with the flow data from the two-flow gauging station as an output vector. Here, Neural network was developed using neuralnet Rpackage developed by [25] to predict the flow in the Uzorong and Muktirap station with the different inputs vectors as shown in the Figure 2. Logistic Activation function with backpropagation option were used while running the ANN model. The stopping criteria for the model simulation was based on an error threshold of 0.01. To have an adequate predictability of the ANN model, the data were first transformed using min-max normalization technique (eqn. 6), by which all the data ranged from 0 to 1. Subsequently the data was randomly split into training and testing data, where each variable in training data had 4458 observation while testing data had 1486 observation.  +  (7) where y is the output vector and xi is the input vector in the neural network, N is the number of neurons, wi is the connection weight between input and output, f is the activation function, and b is the bias term.
Weights and bias are adjusted using the ANN's backpropagation algorithm, where the objective function (also known as loss function) is the error between the network's output and the observed output. The error is minimized using the optimization algorithm known as "Gradient descent" which minimizes the error value by taking steps from an initial guess until it reaches the best value. This make Gradient descent useful, when it is not possible to solve where the derivative of the objective function is equal to zero. The step size is usually calculated by providing the learning rate and is expressed as follows: stepsize slope learningrate =    Table 2. Bias is the difference between the original regression coefficient and the replicate's regression coefficient while Standard error indicates the standard deviation of bootstrap replicates. From the result it is observed that bias and standard error is very minimum and has been accepted for further analysis. Further Root mean square error associated with Decision tree-based model are also shown in Table 3 where Sherichu indicates the maximum RMSE for all meteorological data. This is mainly because there were many missing values in Sherichu at the initial period as compared with the other station, while the RMSE for remaining station fairly remains below 3. Nevertheless, from the error, it can be implied that kNN imputation performed slightly well than that of Tree based model. The decision tree-based model can be further improved by increasing the number of decision tree used in the model however, with the increase in decision tree, the computation time also increases and eventually the user has to make speed-accuracy trade-offs. Finally, the imputation was carried out with MLRM (Fig 3) which clearly shows that the imputed data fits the variability of overall data.   (Fig 4a) show the data variability in the actual and predicted data while Fig  4(b) show the deviation of prediction data from the actual data. From the result it is observed that absolute mean deviation is 0.054% and 0.088 % for Uzorong and Muktirap respectively, while the accuracy of the model was 94.45% and 91.11% respectively.

IV. CONCLUSION
Missing values in the hydro-meteorological data has always been found owing to defects in the sensors or maintenance of sensors. The missingness is also introduced due to relocation of station and error in observation which is usually treated as missing. Such missingness in the data inhibits the researchers in the field of hydrology and climate to draw inference from the data or sometimes leads to abstract inferences from the data. In this paper, missing values in six-meteorological station located in Eastern Bhutan has been imputed using several machine learning models such as kNN and Decision tree model. The data has also been imputed with statistical model using multiple linear regression model. The uncertainty in imputed data from the kNN model was performed with bootstrapping technique and the result showed that the bootstrap replicates follows normal distribution indicating usability of the imputed data.
The tree-based model was assessed using the in-built model's OOB error which indicates minimum error associated with the imputation. The data was finally imputed with multiple linear regression model, where data was found to fairly represent the variability of overall data. Based on the kNN imputed data as an input vector, ANN model was developed to predict the flow at Uzorong and Muktirap flow station. The model was developed considering backpropagation algorithm to calculate the weights and gradient descent optimisation algorithm to minimize the prediction error. The model was trained using the training data and subsequently tested on the testing data. Based on the testing data, the absolute mean deviation for the flow at Uzorong was 0.054% while for Muktirap was 0.088%. Accordingly, the model accuracy was 94.53% and 91.11% for Uzorong and Muktirap respectively. The model accuracy can be further improved by taking more training data which can consider the variability in the overall data.