 Open Access
 Total Downloads : 278
 Authors : Zhang Huaqiang, Meng Mengmeng, Ma Tong
 Paper ID : IJERTV2IS50417
 Volume & Issue : Volume 02, Issue 05 (May 2013)
 Published (First Online): 18052013
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Abnormal Data Processing Of Heating Load Based On Density Estimation
Zhang Huaqiang, Meng Mengmeng, Ma Tong
Department of Electrical Engineering, Harbin Institute of TechnologyHarbin 150001, China
Abstract
Due to the limitation of modern techniques and various interferences, there usually exist some abnormal data in Supervisory Control and Data Acquisition system. It will affect the accuracy of data analysis, load forecasting and even make serious mistakes in production scheduling only according to those data without processing. It is necessary to identify and correct anomalous data once again. However, there are some limitations in conventional data processing approaches, because they mainly focus on one dimension space. Due to horizontal and vertical continuities of heating load data, a novel abnormal data processing algorithm based on density estimation which identifies and processes data in twodimension space is proposed. This method proves feasible by large numbers of examples and simulation results.
Keywords: heating load; data preprocessing; two dimension space; density estimation; abnormal data

Introduction
Central heating system has become popular with the development of cities heatsupply. But it brings about serious problems of energy shortage and environmental pollution. The modern heating systems require effective utilization of heating source and keeping pipeline balance, because it can help to prevent pollution and reduce consumption of energies [1][2]. The reliability of data in Supervisory Control and Data Acquisition (SCADA) system is not only the basis for heating load prediction and characteristic analysis, but also the foundation of regulating load network balance. The appearance of abnormal data, continuous missing data and fluctuation phenomenon caused by channel noise, impact load, sudden accident will seriously affect load forecasting results. Highquality data acquisition and prediction results are important for heating system production scheduling [3] [4]. Therefore, the
identification and secondary correction of abnormal data in SCADA are of great significance.
A large quantity of work on detecting and modifying abnormal data has been studied by domestic and foreign scholars. A method to analyze abnormal load data by applying wavelet singularity detection principle is introduced in reference [5][6]. Reference
[7] describes how to identify abnormal load data by taking advantage of redundant data in SCADA system. This method fits well into abnormal data caused by acquisition system fault, but it can do nothing with abnormal fluctuating loads. The ART2 artificial neural network model is utilized to identify and adjust anomalous data based on characteristic curves extracted from the classified data in advance[8]. However, these methods only take the horizontal or vertical continuity of load into consideration respectively. In other words, they deal with the data in onedimensional space and have limitations to some extent. Given horizontal and vertical continuity of load simultaneously, a method to identify and correct abnormal data in twodimension space based on data density estimation is presented. Firstly, convert the load data series into a two dimensional data matrix with columns and rows corresponding to days and hours respectively[9]. Secondly, identify continuous missing data on the whole, eliminate and adjust the anomalous data. At last, adopt the actual data provided by a certain heating power plant for prediction, large numbers of examples and simulation results indicate that the proposed method is feasible and effective. 
Identifying continuous missing data
Traditional identification methods are based on heating load viscous principle which refers to that adjacent load data will not mutate. Then set a threshold as upper limit of allowable variation among adjacent load data. When absolute value of difference between two adjacent points exceeds the set threshold, these
data are regarded as abnormal. The identification formula is:
L L
Suppose a collected twodimensional data set Z which consists of M data points (the dots shown in Fig.1) and generate a seeds group S (the circles shown
d ,t
L
d ,t 1

L
(1)
in Fig.1) which contains N seeds. The distance between seeds should be constant and the scope of seeds group
d ,t d ,t 1
must be large enough to cover the data set Z. Each data
Where, Ld ,t
is the load data at time t of the d day, is
point z j ( j 1,2,…, M ) should be accompanied by
the set threshold.
a seed adsorption counter ci whose initial value is zero.
Ld ,t
will be classified as abnormal data, when
The seed adsorption counter is applied to sum up the
formula (1) is satisfied. But this approach has some problems in dealing with continuous missing data. In order to solve it, the change rate of adjacent load points is taken as abnormal data identification criteria. The improved identification formula can be written:
Ld ,t Ld ,t 1 t 1
absorbed seeds number. The seed adsorption counter value can be obtained by calculating the distances between data points and seeds.
Ld ,t 1
L L
(2)
The original data values
The original data values
d ,t d 1,24 , t 1
Ld 1,24
The whole process should be carried out in chronological sequence. Abnormal data should be corrected immediately, and be compared with next data once they are detected. The correction formula based on weighted average processing is:
L L L … L (3)
time
Figure 1. Simplified data density scheme
More specifically, calculate the distance between
d ,t
1 d 1,t
2 d 2,t
m d m,t
each seed si (i 1,2,…, N) and each data point in set
Where, Ld m,t
is the load data at time t of dm day;
Z separately. Assume that
z is the closest point of
is weight coefficient which reflects the influence of k
Ld m,t
on Ld ,t
, is defined as:
seed si , the sequence of neighbouring data point zk is
determined by the following formula:
If m 1, then, 2
j
j
(1 ) j1 , (0,1), j 1,2,…, m 1
k arg min( si z j )
(6)
i 1,2,…, N j 1,2,…, M
m
(4)
Where, , ,
j 1
j1
else m 1, then,
stands for the Euclidean distance, arg is the abbreviation of argument. Formula (6) means the value
of j is k when getting the minimum. Once zk which is
1 , (0,1]
(5)
the closest point of seed si is detected according to
Where, is smooth coefficient, t 1,2,…,24 .
Corrected Ld ,t is the sum of historical data at time t multiplied by different weight coefficients.
Compared with traditional threshold method, the
improved method can not only effectively identify continuous missing data by formula (2), but also can avoid miscalculation caused by its reference value[10].


Main title

Method Description
The basic principle of data density estimation method is:
formula (6), then the seed adsorption counter ci that data point being attached to will be increased by one. If there are p data points with the same nearest distance to seed si , then the increment will be distributed equally among these data points. In otherwords, the seed adsorption counter of each data point is added by 1/p. The closest neighbour can be found through calculation with respect to formula (6) for each seed in the seeds group S. And keep updating seed adsorption counters in accordance with above rules until all the seeds are calculated.
The basic principle of detecting abnormal data is: a higher value of seed adsorption counter indicates that the corresponding data point attract more seeds. It
means there are few data points participating in competition with this specific data point around neighbourhood. Hence, it is a data with low density. Conversely, if the data has many other points nearby, there will be a fierce competition in adsorbing seeds among these points. It is clear that the value of seed adsorption counter attached to each data point becomes lower correspondingly. Therefore, the data whose seed adsorption counter value is higher than the set value can be classified as abnormal data, the set value is called seed absorption threshold [11].

Parameter Setting
The algorithm needs to determine two parameters which are seeds number N and seed adsorption threshold. In order to determine the seeds number, a simple and heuristic method is introduced:

Calculate the shortest distance between each data

Convert the load sequence into a two dimensional data matrix;

Determine the seeds number N ;

Generate a seeds group S with constant spacing;

The initial value ck of the seed adsorption counter attached to data point zk is zero;

Calculate the distance between the seed si and all data points, searching for the nearest data point zk of the seed si , and update its seed adsorption counter ck ;

Repeat step 5 until the completion of all seeds
processing;

Determine the seed absorption threshold;

Identify those abnormal data and revise them according to equation (3).
The processing flowchart is shown in Figure.2.
Start
zi and other data points:
2
Determine the seeds number
di zi z j ),i, j 1,2,…, Mandi
)
j (7
Generate a constant spacing

Compute mean value of the shortest distance among data points according to equation (8), take it as the distance between neighbouring seeds:
seeds group
M
M
Initialize the seed adsorption counter
Density Determine the nearest points of Estimation the seedand update the seed
adsorption counter
All seed complete N processing?
Y
Initialize the seed adsorption counter
Density Determine the nearest points of Estimation the seedand update the seed
adsorption counter
All seed complete N processing?
Y
1 M
d di i1
(8)

Determine the scope of seeds which can cover all data points. Assuming that one dimension of the data set ranges from zmin to zmax , the upper boundary smax
and the lower boundary smin
of the seeds set in this
dimension should meet:
smax zmax d
(9)
zmin smin d

Calculate the seeds number after the determination of seeds scale and distance.
Seed absorption threshold can be determined according to overall distribution of seed absorption counter value. The steps to get seed absorption threshold are: Firstly, get the seed absorption counter value in ascending order. Secondly, set a seed adsorption threshold. If the counter value is higher than the threshold, the corresponding data will be modified according to equation (3). In order to obtain better results, the threshold can be adjusted flexibly depending on the specific circumstances [12].



Procedure Based On Density Estimation
Determine the seed absorption threshold
Identify and revise abnormal data
End
Figure 2. Flowchart of abnormal data processing


Simulation Analysis
The sample data are obtained from a heating power plant. From September 12, 2012 to December 11, 2012, take 91 days heating load data as example. According
The abnormal data processing approach based on
to the results of calculation,
d 0.00094 ,
density estimation can be summarized as:
zmin
0.06 ,
zmax
2.57
. Approximate
d 0.001 . smax 2.571 and smin 0.059 are

Grey image of unhandled data

Grey image of traditional

Grey image of improved
calculated by formula (9). The density estimation of 3D
graph is shown in Fig.3, where the abscissa is the
method
method
sampling time in one day with an interval of 30 minutes. The ordinate is the number of days, and the vertical coordinate represents the seed adsorption counter value corresponding to each data point.
the value of seed absorption counter
the value of seed absorption counter
3000
2500
2000
1500
1000
500
0
100
Figure 4. Comparison in grey image
Dealing with the heating load in horizontal or vertical direction respectively, traditional threshold method can identify 35 and 47 data correspondingly, and there appear 16 repeated identified data in both directions. Hence, the horizontal and vertical threshold method can detect 66 abnormal data in total. Compared with traditional threshold method, density estimation algorithm with the threshold of 500 can identify 76 abnormal points, and 55 of them are the same as traditional threshold method. Comparison result of the two methods is in tab.1. In case of the similar abnormal data rate, the method based on density estimation is more simple, feasible and effective.
50
day
50
40
30
20
10
0 0
time
Table 1. Comparison results of different methods
Figure 3. The density estimation of 3D graph
Because the ratio of abnormal data in shortterm load forecasting does not exceed 3%, take the smooth coefficient as 0.5. According to the "the closer, the bigger" principle and the constraint of , define 1=0.5
2=0.25 3=0.25. Combined with the improved
methods, set the seed absorption threshold as 500, and 76 abnormal data points can be identified, accounting for about 1.74% of total data set. The following analysis will show the effect of abnormal data processing from two aspects.

Data Identification
Being normalized, the twodimensional data matrix can be transformed into a new matrix in which elements are between 0 and 1. Because the normalized matrix is similar with grey image matrix, continuous missing data processing could be more intuitively characterized by grey image. Thus, convert the partial data which contains continuous missing data into grey graph, which is shown in Fig.4. A good identification result can be achieved and the noise in image is reduced obviously after effective treatment.
Algorithm
Traditional threshold method
Density estimatio n
Horizontal processing only
Vertical processing only
Threshold value
0.98
1.25
500
Number of abnormal data
35
47
76
The ratio of abnormal data/%
1.51
1.74
Overlapping
55
Algorithm
Traditional threshold method
Density estimatio n
Horizontal processing only
Vertical processing only
Threshold value
0.98
1.25
500
Number of abnormal data
35
47
76
The ratio of abnormal data/%
1.51
1.74
Overlapping
55

Evaluation By Accuracy Rate Of Daily Load Prediction
Central heating system is a complicated multi variable control system. Its characters of large heating area, many influence factors, strong internal relevance, long time delay and serious nonlinearity cause some difficulty in load forecasting. However, Radical Basis Function (RBF) neutral network which has been widely used in time series analysis and nonlinear control can process any nonlinear functions. Therefore, RBF neutral network is usually used to forecast the load data regardless of traditional and improved method. Taking the accuracy rate of daily load prediction as the evaluation index of prediction effect, it is defined as [15]:
24
24
A (1 1 E 2 ) `100%
i
i
(10)
24 i1
Where, Ei
is the relative error at time i of the
forecasting day, and A is accuracy rate of daily load prediction.
Load forecasting results in one week are shown in Tab.2("–" indicates the data of 0 value in the data set
which is unable to quantify the relative error). The proposed identification method based on data density estimation can well identify continuous missing data. The average prediction accuracy improved by 1.73%, its prediction effect is superior to traditional processing method based on single dimensional space.
Table 2. Comparison results of load forecasting accuracy
Daily load accuracy/% Accuracy

Gao Shan, Shan Yunda. A new method of load data error correction, Conference, Proceedings of the CSEE, 2001, 21(11): 105108.

Ye Feng, He Hua, Gu Quan. Bad data identification and correction for load forecasting in energy management system, Journal, Automation of Electric Power Systems, 2006, 30(15): 8588.

Gu Min, Ge Liangquan, Qin Jian. Identification and justification of dirty electric load data based on modified ART2 network, Journal, Automation of Electric Power Systems, 2007, 31(16): 7074.
Time
Traditional
threshold
Improved method
improved by
/%

Tong Shulin, Wen Fushuan, Chen Liang. A two dimension wavelet threshold denoising method for electric
method
Monday 97.55 98.49 0.94
Tuesday 96.36 98.72 2.36
Wednesday — 99.47 —
Thursday — 97.78 — Friday 96.98 98.25 1.27
Saturday 95.95 98.42 2.47
Sunday 96.63 97.84 1.21
Mean value 96.694 98.424 1.73


Conclusion
Given the horizontal and vertical continuity characteristics of heating load a novel algorithm which is capable of detecting and modifying anomalous data in twodimensional space based on data density estimation is put forward. It can avoid the deficiencies of single dimensional space processing, eliminating the abnormal points from overall data and correcting them once again. The average prediction accuracy is
improved by 1.73%. The results of example analysis and simulation results verify that the anomalous data identification method based on data density estimation is more feasible and effective than traditional methods.

References

Wang Lei, Zhang Ruiqing and Sheng Wei. Regression forecast and abnormal data detection based on support vector regression, Conference, Proceedings of the CSEE, 2009, 29(8):9296.

Song Yongqi. Study on new measuring and control device in household metering heating system, Dissertation,. Harbin: Master's thesis of Harbin Institute of Technology, 2010: 118.

Pang Qiang, Yuan Mingzhe, Zou Tao. Online rectification method of flow measurement error in steam pipe network and its application, Journal, Chinese Journal of Scientific Instrument, 2013, 34(1): 4650.

Zhang Xiaolei, Zhang Yanyan, Tang Lixin. Steam allocation plan considering production and electricity generation, Journal, Control Engineering, 2012, 19(6): 997 1002.

Niu DongxiaoEcho state network with wavelet in load
forecasting, Journal, Emerald Journal, 2012, 41(10): 1557
1570.
load data processing, Journal, Automation of Electric Power
Systems, 2012, 36(2): 101103.

Li Guangzhen, Liu Wenying, Yun Huizhou. A new data preprocessing method for bus load forecasting, Journal, Power System Technology, 2010, 2(34): 150151.

Wang Yang. A novel algorithm for outlier removal based on density, Journal, ACTA AUTOMATICA SINICA, 2010, 36(2): 333346.

Chen Liang, Wen Fushuan, Tong Shulin. A method to identify and correct the abnormal electricload data based on density evaluation, Journal, Journal of South China University of Technology(Natural Science Edition), 2012, 40(2): 124129.

Zhan Tengxi, Guo Guanqi. Intelligent hybrid prediction method of the flue gas oxygen content in power plant, Journal, Chinese Journal of Scientific Instrument, 2010, 31(8): 18261833.

Liu Yanwei. Research on the heating system load forecasting based on natural network, Dissertation, Tianjin: Master's thesis of Tianjin University, 2009: 1625.

Li Ruqi, Chu Jinshen, Xie Linfeng. Application of IAFSARBF neutral network to shortterm load forecasting, Conference, Proceedings of the CSUEPSA, 2011, 23(2): 142147.