 Open Access
 Total Downloads : 11
 Authors : I. Nandhini, Dr. K. Amshakala
 Paper ID : IJERTCONV5IS17022
 Volume & Issue : RTICCT – 2017 (Volume 5 – Issue 17)
 Published (First Online): 24042018
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Distributed Maximum Likelihood Principal Component Analysis for Wireless Sensor Network Data
I. Nandhini[1],
PG Scholar,
Department of Computer Science and Engineering, Coimbatore Institute of Technology,
Coimbatore.
Dr. K. Amshakala[2],
Assistant Professor,
Department of Computer Science and Engineering, Coimbatore Institute of Technology,
Coimbatore.
Abstract: Principal Component Analysis is a data dimensionality reduction technique well suited for processing data from sensor networks and also is an effective anomaly detection technique. In an environment where the anomalies are present in the dataset, the derived principal components can be misleading by the anomalies .The anomalies can be detected by analyzing the data collected from the wireless sensor network across the environment. In this paper, a distributed maximum likelihood PCA algorithm is proposed that is more efficient in finding the principal components from the data containing anomalies. The algorithm uses maximum likelihood functions to find principal components locally and then compares it with principal components computed across the network to identify the anomalies.
Keywords Principal Component Analysis, Distributed Maximum likelihood PCA, Anomalies.
1 INTRODUCTION
Efficient innetwork data processing is a key factor for enabling wireless sensor networks (WSN) to extract useful information and an increasing amount of research has been devoted to the development of data processing techniques. Wireless sensors have limited resource constraints in terms of energy, network data throughput and computational power.

Motivation
Anomaly detection, also known as outlier detection, is a machine learning problem. An anomaly is defined by Barnett and Lewis as an observation (or subset of observations) which appears to be inconsistent with the remainder of the data[1] Anomaly detection aims to identify data that do not conform to the patterns exhibited by the data set.[2] Methods often use an unsupervised one class classification approach. The problem thus has two important characteristics, the data are not labeled and there is a class imbalance in the training set where the number of normal data significantly exceeds the number of anomaly data. The nature of sensor, peer to peer and adhoc wireless networks requires a distributed learning approach, as it is infeasible to communicate all data to a centralized node for computation. There are several reasons why data might be
in different physical locations. It is too costly to transfer the data to one physical location. Examples include limited energy resources, such as in Wireless Sensor Network (WSN)s, and limited time resources, such as in network intrusion(anomalies).

Contribution
In this paper, a distributed anomaly detection scheme based on the principal component analysis (PCA) and the maximum likelihood function is proposed. The approach addresses the challenge of performing anomaly detection in a wireless sensor networks.

A distributed version of PCA based upon maximum likelihood function. This improves on the performance of classical PCA.

A detailed evaluation of anomaly detection in a distributed environment is provided.


RELATED WORKS
There are two approaches to learning in a distributed environment. The first assumes a structure to the network,
i.e. partially distributed learning and another approach is to make no assumptions on the structure of the network, i.e. fully distributed.[3]Large numbers of limited resource sensor nodes operate autonomously to collaborate and manage the wireless networks, through which critical raw data are collected and transmitted to the end users/decision makers. WSNs have been used in critical application scenarios, such as enemy target monitoring and tracking and fire detection system WSNs can be susceptible to anomalies due to cheap unreliable hardware and software, and unfavorable operating environment that can affect the network communication. These anomalies must be detected as they can cause failures in the network and hence affect quality of collected data.[4]Principal component analysis (PCA) is one of the most widely used multivariate techniques in statistics. It is commonly used to reduce the dimensionality of data in order to examine its covariance/correlation structure of a set of variables which gives optimal solution to the problem. PCA [5] is a spectral decomposition technique that has been shown to perform well as an anomaly detector (e.g.,[6], [7]). There are several methods that have been used to construct the
principal components (PC)s in a partially distributed environment, for example fusing data [6], and constructing PCs at a cluster head [8]. [9] Power iteration method in PCA, which closely matches the approach proposed in this article. Their algorithm aims at computing the principal eigenvector of the adjacency matrix of a peertopeer network, which leads to the ranking of the network components, in a way similar to the page rank algorithm. The underlying hypotheses of the distributed system are interestingly close to ours (each component has only access to one dimension the problem, and can communicate with components whose data is related).
Huang et al. [10] propose a distributed anomaly detection method that focuses on volume anomalies, unusual traffic load levels caused by worms, distributed denial of service attacks and so on. It is wellknown that PCA is extremely fragile in the presence of anomalies in the training data set and even a small number of anomalies can significantly alter the subspace generated [11], [12]. Various techniques have been proposed in order to overcome this issue. Multivariate trimming [13], [14] aims to remove the outliers before deriving the PCs from the clean training data set. [15] Propose a distributed anomaly detection approach for WSNs which only assumes that the network is strongly connected. Each node has a local data set with the aim of computing the set of the global topk anomalies, the scheme is generic in that it is suitable for all densitybased methods, except Local Outlier Factor (LOF). A fully distributed consensusbased approach for PCA is proposed by Macua et al. [16]. The networkwide covariance matrix is estimated through the use of a consensus averaging (CA) algorithm and an exchange of p Ã— p matrices. PCA is then performed on each node. The algorithm is shown to have guaranteed convergence using only communication with neighboring nodes.[17] A probabilistic faulty detector in wireless sensor networks to improve the battery life of sensor nodes which minimizes additional power burden to sensor batteries.[18] The MLPCA dealing with incorporation of correlated measurement errors in PCA using maximum likelihood estimation. [19] PPCA demonstrate how the principal axes of a set of observed data vectors may be determined through maximum likelihood estimation of parameters in a latent variable model closely related to factor analysis with iteratively given principal component subspaces. [20] Proposes novel based PCA for detecting anomalies in unsupervised training datasets by the measure of the difference of anomaly from normal instances in principal components.

DISTRIBUTED MAXIMUM LIKELIHOOD PCA
In order to define global anomalies in a local data set, it is a requirement to build a classifier on a local node that has been built using information from the local dataset in a network. In order to perform this, by anomaly detecting approach MLPCA. This echnique performs superior in the presence of anomalies in the training dataset.

Maximum Likelihood
Maximum likelihood is the estimation of probability of occurrence of likelihood which is maximum in a dataset.
n
L( , x) f (xi, )
i1
Parameter would be the value of that maximizes the probability that is Likelihood of data.

Proposed system
Before look into the pseudo code and formal analysis of algorithm. Let us first gain some preprocessing work of given dataset because the sensors absorbs temperature and voltage datas which are inversely propositional so the scale value of both data is different. Convert temperature values and voltage values into HADS scaled form. Export the results and estimate the maximum likelihood function, find PCA for the ML. Detect the anomalies from PCA values. Calculate precision and accuracy for performance assessment provides which sensor was fault. Compare the mean average PCA and maximum likelihood PCA which gives DMLPCA was better and highly suits for the given environment. The pseudo code is given below.
Pseudo code for DMLPCA
Input: Temperature, Voltage Output: X, Y

Preprocessing the values of Temperature and Voltage into scaled form

Export the result from step1

Compute Maximum likelihood

Compute PCA from the likelihood

Detect Anomalies from PCA values

Do byzantine agreement

Calculate precision and accuracy for performance assessment

Compare with classical PCA along with obtained mean
Fig 1.1 Pseudo code for DMLPCA
The flow for the Distributed maximum likelihood principal component analysis eventually performs same process described in the above pseudo code. At first it gets the input data and convert them into scaled form, (i.e.,HADS) apply the maximum likelihood function to the dataset, and compute PCA from the likelihood function data. Finally detect the anomalies from the PCs locally. Aggregate all the local anomalies to detect the global anomaly. The evaluations of the dataset, description of dataset and performance assessment are given below.
Flow diagram for DMLPCA
Input dataset
Data preprocess
Apply Maximum likelihood PCA
PC data
Detect anomalies from PCs locally
Produces anomaly free data globally from local data
Performance assessment


EVALUATION
In this section, evaluations on realworld data are presented to illustrate the performance of DMPCA and distributed MLPCA. The evaluation environment is varied in order to examine the behavior of the proposed algorithm in a broad range of settings. All algorithms are implemented in C#.

Evaluation Environment
The elements considered in the evaluation are wireless sensor network topology and data sets.

Dataset description
Experiments were carried out using a set of five days of temperature readings obtained from a 54 Mica2Dot sensor deployment at the Intel research laboratory at Berkeley. The readings were originally sampled every thirtyone seconds. A preprocessing stage where data was discretized in thirty second intervals was applied to the dataset. After preprocessing, the dataset contained a trace of 14400 temperature and voltage readings from 52different sensors, randomly chosen without replacement from the appropriate class of the data set. The testing set consists of an equal number of normal and anomaly samples. To form the data sets in a distributed environment, an equal number of data instances is randomly distributed across the sensors. Temperature over the whole set of data ranged from about 0_C to 35_C and voltage ranged from about 0_ V to35_V.

Performance Assessment
To examine performance, false positive rate (FPR) is the ratio of false positives and the true positive rate (TPR) is the ratio of true positives are calculated to find precision
and accuracy for anomaly measurements of both DMPCA and DMLPCA. To compare schemes, curves are generated by varying the number of faulty sensor nodes.

DMLPCA evaluation on real world data
In this section, the performance of DMLPCA is examined and compared with DMPCA anomaly detection methods. Realworld data sets are used to examine performance of DMLPCA. The algorithms, is used to determine the optimal value of the data.
In the local approach, the data are randomly distributed between the nodes in the network. Each node constructs a classifier from the data available on the sensor node. The same test data set is used across all nodes. For DMPCA mean of the local classifiers is noted and then the mean and standard deviation of the performance over the 10 iterations are recorded. For DMLPCA maximum likelihood of the local classifiers over the 10 iterations are recorded.

Visualization on a real data Set
Fig 1.2 Select data
Fig 1.3 Data preprocess
Fig 1.4 Anomaly detection from pcs
Fig 1.5 TP,FP,TN,FN calculation
Fig 1.6 Precision and accuracy


Comparision of DMPCA and DMLPCA
Fig 1.7 Comparison graph
1
Precision for voltage
0.8
0.6
0.4
0.2
0
Accuracy for temp
1
0.8
0.6
0.4
0.2
0
0 1 2 3 4
Number of faulty sensors
0 1 2 3 4
Number of faulty sensors
DMPCA
DMLPCA
DMPCA DMLPCA

CONCULSION
A robust PCAbased anomaly detection algorithm that operates in a distributed environment was proposed. Maximum likelihood PCA is able to determine the PCs more robustly in the presence of anomalies by constructing likelihood around the data that reduces the influences of anomalies in the training set. Evaluations on realworld data sets show the performance of DMLPCA exceeds that of DMPCA (Distributed Mean PCA) anomaly detection techniques. DMLPCA takes maximum likelihood function and follows byzantine agreement of N+1/2 non faulty nodes. The byzantine agreement problem requires all fault free processors to agree on a common value, even if some components are corrupt. Realworld data set shows distributed algorithm is able to increase the performance of DMLPCA than DMPCA.
REFERENCES

V. Barnett and T. Lewis, Outliers in Statistical Data. New York, NY, USA: Wiley, Apr. 1994.

V. Chandola, A. Banerjee, and V. Kumar, Anomaly detection: A survey, ACM Comput. Surveys, vol. 41, no. 3, pp. 15:115:58, Jul.2009.

S. Rajasegarar, C. Leckie, and M. Palaniswami, Hyperspherical cluster based distributed anomaly detection in wireless sensor networks, J. Parallel Distrib. Comput., vol. 74, no. 1, pp. 1833 1847, 2014.

Robert Reris and J. Paul Brooks Principal Component Analysis and Optimization: A Tutorial 14th INFORMS Computing Society Conference Richmond, Virginia, January 11{13, 2015pp. 212

H. Hotelling, Analysis of a complex of statistical variables into principal components, J. Educational Psychol., vol. 24, pp. 417 441, 1933.

V. Chatzigiannakis and S. Papavassiliou, Diagnosing anomalies and identifying faulty nodes in sensor networks, IEEE Sens. J., vol. 7, no. 5, pp. 637645, May 2007.

A. Lakhina, M. Crovella, and C. Diot, Diagnosing networkwide traffic anomalies, ACM SIGCOMM Comput. Commun. Rev.,vol. 34, pp. 219230, 2004.

M. A. Livani and M. Abadi, Distributed PCAbased anomaly detecton in wireless sensor networks, in Proc. 5th Int. Conf. Internet Technol. Secured Trans., Nov. 2010, pp. 18.

L. Huang, X. Nguyen, M. Garofalakis, M. I. Jordan, A. Joseph, and N. Taft, Innetwork PCA and anomaly detection, in Proc. Adv. Neural Inform. Process. Syst., Dec. 2006, pp. 617624.

J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma, Robust principal component analysis: Exact recovery of corrupted low rank matrices via convex optimization, in Proc. Adv. Neural Inform. Process. Syst., Dec. 2009, pp. 20802088.

E. J. Cand_es, X. Li, Y. Ma, and J. Wright, Robust principal component analysis? J. ACM, vol. 58, no. 3, pp. 137, May 2011.

D. M. Titterington, Estimation of correlation coefficients by ellipsoidal trimming, Appl. Statist., vol. 27, no. 3, pp. 227234, 1978.

R. Kwitt and U. Hofmann, Robust methods for unsupervised PCAbased anomaly detection, in Proc. IEEE/IST Workshop Monitoring, Attack Detection Mitigation, Sep. 2006, pp. 13.

J. W. Branch, C. Giannella, B. Szymanski, R. Wolff, and H. Kargupta, Innetwork outlier detection in wireless sensor networks, Knowl. Inform. Syst., vol. 34, no. 1, pp. 2354, 2013.

S. Macua, P. Belanovic, and S. Zazo, Consensusbased distributed principal component analysis in wireless sensor networks, in Proc. IEEE 11th Int. Workshop Signal Process. Adv. Wireless Commun., Jun. 2010, pp. 15.

YannAÂ¨el Le Borgne 1;?, Sylvain Raybaud2, and Gianluca Bontempi1 Distributed Principal Component Analysis for Wireless Sensor Networks Sensors 2008, 8, 48214850; DOI: 10.3390/s8084821

Bill C.P. Lau a, Eden W.M. Maa, Tommy W.S. Chow a,b, Probabilistic fault detector for Wireless Sensor Network B.C.P. Lau et al. / Expert Systems with Applications 41 (2014) 3703 3711.

Peter D. Wentzell,1 Darren T. Andrews,1 David C. Hamilton,2 klaas faber3 and Bruce R. KowalskI3 Maximum likelihood principal component analysis Journal of Chemometrics, Vol. 11, 339366 (1997).

Michael E. Tipping Christopher M. Bishop Probabilistic Principal Component Analysis Journal of the Royal Statistical Society, Series B, 61, Part 3, pp. 611622.

Intel Berkeley Research lab between February 28th and April 5th, 2004.