Use of Multivariate Statistical Analysis for Detecting Spatial and Seasonal Attributes of Surface Water Quality

The existence of both point and non-point inputs of pollutants raises the cost of water body treatment due to their negative effect on watersheds. Additionally, categorizing the most significant surface water quality parameters (SWQPs) in both spatial and temporal domains is crucial. Thus, to classify the dominant SWQPs and accordingly identify both spatial and temporal aspects of surface water quality, multivariate statistical analysis (MSA), such as principal component analysis/factor analysis, cluster analysis, and discriminant analysis, was used. The obtained results demonstrated that turbidity, total suspended/dissolved solids, chemical oxygen demand, and biochemical oxygen demand are the dominant SWQPs, which contribute to spatial and temporal surface water quality status of the Saint John River, Canada. Moreover, a decrease in the dimensionality of surface water quality data was achieved. To conclude, the use of MSA can lead to effective savings and applicable exploitation of water resources.


INTRODUCTION
Point sources of pollutants establish a fixed polluting source; while, non-point sources represent seasonal circumstances (Singh et al. 2004). These sources of pollution can negatively affect surface water quality of watersheds by raising concentrations of surface water quality parameters (SWQPs) (Carpenter et al. 1998;Qadir et al. 2007). The proper treatment process must be directed to the most significant SWQPs, which contribute to spatial and seasonal changes of surface water quality (Elhatip et al. 2007). By doing this, valuable savings and correct utilization of water resources can be simply achieved (Natural resources 2016).
Thus, multivariate statistical analysis (MSA), such as principal component analysis/factor analysis (PCA/FA), cluster analysis (CA), and discriminant analysis (DA), are used to better understand the updated status of water quality of a specified watershed (Vega et al. 1998 The use of MSA is essential due to its potential of clarifying the relationship between various SWQPs, such as total dissolved solids (TDS), total solids (TS), total suspended solids (TSS), turbidity, biochemical oxygen demand (BOD), dissolved oxygen(DO), chemical oxygen demand (COD), electrical conductivity (EC), temperature, and power of hydrogen (pH). Furthermore, it is very difficult to extract obvious conclusions from raw data of surface water quality. Hence, MSA could be employed to detect surface water quality changes and categorize the major SWQPs of water bodies (Reghunath et al. 2002).
Locating the association between water sampling stations, decreasing the complexity of large-scale datasets into clusters with similar characteristics, and recognizing the dominant SWQPs are the main advantages of MSA. Conversely, the occurrence of the same SWQPs in different principal components (PCs) and difficulty in realizing the appropriate number of classified groups are the main drawbacks of MSA (Singh et al. 2004).
The key objectives of our research study are to:

Study site
The SJR is one of the oldest streams in Canada. It covers an area of 4748 km 2 . Oromocto, Nashwaak, Keswick, Miramichi, Tobique, Aroostook, and Madawaska feed the SJR. The average width of the SJR is 750 m and the average depth is 3 m (Arseneault 2008). The study area comprises a 130 km long, which covers both the lower and middle basins of the SJR (Fig. 1).   Optical SWQPs, such as turbidity, TSS, TS, and TDS were analyzed. Additionally, non-optical SWQPs, such as COD, BOD, DO, pH, EC, and temperature were analyzed.

Multivariate statistical analysis (MSA)
SWQPs were exposed to MSA to figure out the most important parameters that were responsible for both spatial and seasonal changes in the SJR. The working strategy of MSA, such as PCA/FA, CA, and DA, is provided below.

Principal component analysis/factor analysis (PCA/FA)
PCA has been utilized to linearly transform the raw dataset into a new uncorrelated dataset, called principal components (PCs). PCA offers information about the major variables within the utilized dataset (Shrestha & Kazama 2007). PCs can be calculated according to the following equation: where is the score of each PC; is the loading of each PC; is the measured value of each parameter; is the number of each PC; is the number of the sample; is the total number of parameters.
Following PCA, FA was employed to retain variables with major significance and minimize the influence of variables with negligible significance (Vega et al. 1998;Simeonov et al. 2003). FA can be stated as follows: where is the measured parameter; is the loading value of each parameter; is the score of each factor; is the term of errors; is the number of the sample; is the total number of factors.

Cluster analysis (CA)
CA classifies entities into discrete clusters (McKenna 2003). Hierarchical agglomerative CA was employed on the utilized surface water quality dataset. A dendrogram is the main visualized result of hierarchical agglomerative CA, which can provide a summary of the obtained clusters with a remarkable decline in dimensionality of the raw dataset (Shrestha & Kazama 2007). Discriminant analysis (DA) DA establishes relationships between pre-defined clusters according to discriminating variables (Singh et al. 2004). The number of the achieved discriminant functions is either the number of clusters -1, or the number of the parameters, whichever is smaller.
3. RESULTS AND DISCUSSION 3.1 Analysis of SWQPs As shown in Table 1, ten SWQPs, such as Turbidity, TSS, TS, TDS, COD, BOD, DO, pH, EC, and Temperature, were obtained from 66 sampling points according to the APHA standard methods. The range of turbidity concentrations was 1.19 to 13.10 NTU with an average of 4.84 NTU. TSS levels ranged from 0.60 to 11.40 mg/l with a mean value of 3.59 mg/l. TS ranged from 58.00 to 245.00 mg/l, and TDS varied from 52.40 to 233.85 mg/l. While COD ranged from 4.80 to 86.64 mg/l with an average 27.55 mg/l, BOD levels were 1.21 to 3.25 mg/l with an average 1.75 mg/l. Finally, levels of DO, pH, EC, and Temp were 6.71 to 14.14 mg/l, 6.51 to 8.42, 29.50 to 148.90 us/cm, and 5.00 to 23.30 °C, respectively.
In spring, Turbidity and TSS levels were higher than their levels in summer season due to soil erosion from the presence of rainfall and snow melt. Soil erosion could push sediments from forestry into the SJR basins (Sharaf El Din et al.

PCA/FA
PCA/FA was employed to categorize the most significant SWQPs in the SJR. It was applied on 66 sampling points using ten SWQPs (i.e., turbidity, TSS, TS, TDS, COD, BOD, DO, EC, pH, and temperature) to classify the dominant parameters contributing to water quality in the selected study site of the SJR. A set of PCs was generated along with their corresponding eigenvalues, which measure the significance of the extracted PCs, by using PCA. Eigenvalues of ≥ 1 are considered significant (Shrestha & Kazama 2007). PC1, PC2, and PC3 have eigenvalues > 1; hence, they are considered as the dominant PCs. As shown in Table 2, PC1, PC2, and PC3 captured 88.126% of the total variation in the data of the SJR. These three PCs explained 49%, 20%, and 19% of the total variance, respectively.  Each SWQP with a loading value of 0.75 or higher was considered as a significant parameter, which may contribute to water quality changes in the SJR. On the other hand, SWQPs with loading values less than 0.40 were considered as insignificant. In Fig. 3, turbidity, TSS, TS, and TDS were loaded as strong with positive values.
In PC1, Turbidity TSS, TS, and TDS are the dominant SWQPs, which contribute to both spatial and seasonal surface water quality changes in the river. Soil erosion is the main cause of increasing concentrations of turbidity and TSS because of the existence of natural and human processes, such as snow melt, rainfall, forestry, and agricultural activities.
PC2 verified that the loading value of EC was considered as strong with positive values; while, pH was loaded as moderate. Hence, EC is the major SWQP responsible for spatio-temporal surface water quality changes in the SJR due to the existence of irrigation purposes. PC3 clarified that COD and BOD have strong positive loading values, due to industrial sewage, which may be resulted from food processing and paper production industries along the shoreline of the river.
The above outcomes confirmed that PCA/FA is a cost-effective method in surface water quality research studies owing to its potential of classifying the most significant pollutants in the SJR.

CA
Hierarchical agglomerative CA was utilized to identify clusters, which have the same properties of surface water quality. As shown in Fig. 4, CA created multiple levels of clusters, and a dendrogram that classifyed the collected water sampling points (i.e., 66 samples) into four discrete clusters was produced. As shown in Fig.4, cluster 1 contains 28 water samples, and these samples have higher COD and BOD values because of the existence of food and paper production industries. Cluster 2 includes 15 water samples, and these samples represent the lower basin of the river, which has less industrial and agricultural sewage. Cluster 3 contains 10 water sampling points, and these samples were collected in spring, which means that these 10 samples receive pollution mostly from natural resources, such as rain fall and snow melt in the study site of the SJR. Similar to cluster 3, cluster 4 includes 13 water samples, and these samples were collected in spring. Natural resources were responsible for raising soil erosion, and consequently raise turbidity and TSS concentrations in the SJR.
These findings showed that hierarchical agglomerative CA is essential due to its potential to categorize water samples into separate clusters based on surface water quality characteristics. DA was utilized to further assess spatial changes in surface water quality using clusters, which were produced by hierarchical agglomerative CA. The four clusters represent the dependent variables, while the 10 SWQPs were utilized as the independent variables. As shown in Fig. 5, DA was employed, and three discriminant functions were developed. The obtained clusters were obviously distinguished using function 1 and function 2.  (August 2016)). The four temporal clusters were used as dependent variables; whereas, the 10 SWQPs were utilized as independent variables. As shown in Fig. 6, the four seasonal clusters were categorized using the first two discriminant functions.

CONCLUSION
To better exploit water resources, it is essential to direct water treatment of waterbodies towards the most significant SWQPs. Hence, MSA was used to classify the key SWQPs in the SJR, diminish the complexity of water quality data, and assess spatiotemporal changes in surface water quality of the study site of the SJR. The key findings of this study are: (1) Turbidity, TSS, COD, BOD, pH, and EC are the main SWQPs in the river.
(2) Hierarchicalagglomerative CA gathered 66 water sampling stations into four clusters, which means an obvious decrease in the water quality dataset was accomplished. (3) DA identifies four seasonal groups (early spring, late spring, early summer, and late summer).