 Open Access
 Total Downloads : 45
 Authors : Bhagyashri Karkhanis , Sanjay Sharma
 Paper ID : IJERTV8IS090092
 Volume & Issue : Volume 08, Issue 09 (September 2019)
 Published (First Online): 23092019
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Outlier Detection in High Dimensional Data Streams to Detect Lower Subspace Outliers Effectively
Bhagyashri Karkhanis
Dept. of Computer Science and Engineering Oriental Institute Of Science and Technology Bhopal, India
Sanjay Sharma (Asst. Prof)
Dept. of Computer Science and Engineering Oriental College of Technology
Bhopal, India
Abstract Continuous, realtime processing of vast amounts of rapidly changing data lies at the heart of many modern and emerging applications. Detecting outliers is to recognize the objects that expansively turn outoftheway beginning the widespread allocation of the authentic data. Such that items may be distinguish as apprehensive data items as a result of the dissimilar method of production. Various algorithms have already occupational well in such a background for discovering outlier point. Various machine learning techniques are extending modern outlier detection techniques develop into persistent undertakings.
Index Terms Outlier detection, k nearest neighbours (kNN), local outlier factor (LOF), intrinsic dimension.

INTRODUCTION
Various quantities of connected devices grows the velocity and volume of the data they produce and consume also grow. This massive data is scattered across many physically distributed sites, incurring high communication costs. The identification of outliers (i.e., data objects that do not fit well to the general data distribution) is very important in many practical applications. Application examples are the finding of fraud in financial transactions data, the identification of measurement errors in scientific data or the investigation of activities at statistics data. Various researches on such type of unsupervised problem of outlier detection precede the neighborhood by concerning collection methods [3]. Ensemble techniques, i.e., merging the findings or results of human being learners to a combined normally more consistent and enhanced accuracy are well started in the supervised context of classification or regression [1]. In unsupervised learning, the theoretical underpinnings are less clear but can be drawn in analogy to the supervised context as it has been done for clustering ensembles [2]. Here they presents two unsupervised algorithms to detect outlier observations whose aberrant behavior is hidden in lower dimensional subspaces or cannot be identified with the use of a single detector.
Most such applications arise in very highdimensional domains. For instance, the credit card data set contains
transaction records described by over 100 attributes [4]. To detect anomalous motion trajectories in surveillance videos, we have to deal with very high representational dimensionality of pixel features of sequential video frames [5]. Because of the notorious curse of dimensionality", most proposed approaches so far which are explicitly or implicitly based on the assessment of differences in Euclidean distance metric between objects in fulldimensional space do not work efficiently. Conventional methods to detect distancebased outliers [6] or densitybased outliers [7] suffer from the high computational difficulty for highdimensional adjacent neighbour search.
Application area of main data streams consist of network traffic, telecommunications data, financial market data, and data from sensors that scrutinize the weather and environment, surveillance video and soon. Outlier detection from stream data can find items i.e. objects or points that are uncharacteristic or unbalanced about the popular of items in the entire or a horizon/window of the stream.
Detecting outliers is to categorize the objects that significantly diverge from the common allocation of the data. Finding outlier point in main data streams can be valuable in many research areas such as analysis and scrutinizing of network traffic data e.g., connectionoriented records, web log, wireless sensor networks and financial transactions, etc.
Figure: Outlier Detection points
To encourage this research is that outliers accessible in real data streams are set in a few lowerdimensional subspaces. At this time, a subspace refers to as the real data space of outlier points. The survival of projected outliers is inspired that as the data dimensionality reduce on outlier data have a tendency to develop into regularly faraway from each other. Thus, the high esteem of data points outlierness will be converted into progressively more fragile and thus undistinguishable. As such, it is attractive to propose a various new techniques that well explain the disadvantages of these existing methods.

PROBLEM STATEMENT
Outlier detection from stream data can find items i.e. objects or points that are abnormal or irregular regarding the common of items in the entire or a horizon/window of the data stream. They make available a collection of solutions to embark upon these disadvantages. But they focus on several of these open research issues, such as the relationship between numerous characteristic reduction methods and the resulting classification accuracy. The main objective is to recognize a set of features that best estimated the novel data without classification result. Other problems are based on computational cost of feature reduction algorithms for upcoming data requires developing computationally efficient feature reduction techniques which can be achieved concurrently. To finding outlier point various algorithms are originated upon statistical modeling techniques it can be any of them whether predictive or direct. Predictive techniques use tagged data using training sets to produce a finding outlier point data model i.e. contained by which outliers reduce for a domain which is subsequently utilized to categorize original data objects. Bit direct techniques are consist of deviation, proximity, statistical clustering and density based techniques pass on to those in which tagged training sets are occupied and for that explanation the organization of objects as finding outlier point is implemented through the measurement of statistical heuristics. Although characteristically more composite than predictive techniques, direct methods are not as much of constrained as detection is not dependent upon predefined models.

OUTLIER DETECTION IN LARGE AND HIGH DIMENSIONAL DATA
Todays the outliers can be detected by conducting hypothesis tests against an assumed distribution of the underlying process that produces the dataset. In many cases, the distributions are unknown. Moreover, the statistical tests are generalized to multivariate tests for datasets with more than one attribute. The multivariate statistical methods, however, are only effective for a extremely minute number of attributes. Therefore, distancebased outlier detection is practically preferred for multidimensional data by comparing the distance between the points in a dataset.
Breunig [8] introduces the idea of local densitybased outliers. An observation that deviates greatly from its neighbors regarding its local density is considered to be an outlier. The density is measured by the length of the knearest neighbor distances of its neighbors. Even though a local outlier may not deviate from all the other observations, it,
indeed, may signal interesting information. For example, a network activity may be considered exceptionally high depending on the nature of the network. Specifically, a server dedicated for content streaming likely requires more bandwidth than email servers.
However, in the calculation of local densitybased outliers, the computation of the knearest neighbor is cosly. The local densitybased method utilizes multidimensional index trees to rate up the knearest neighbor computation. A multidimensional index tree is a hierarchy of the boundaries of the set of points in the subtrees. When the numeral of dimensions enhances the number of possible boundaries to contain the subsets explodes. Therefore, the presentation of the index trees deteriorates in high dimensions. One of our contributions in this thesis is to introduce a method that can compute the local densitybased outliers efficiently. We observe that when the dataset is partitioned into multiple subspaces, the local outliers are still prominent in any of these subspaces. In many cases, an outlier deviates greatly from only some subsets of the attributes. In multi dimensional data, the delta deviations can be accumulated. When the number of attributes is large, the accumulation can be significant. This accumulation can even be greater than the deviation of an outlier in the set of attributes when it shows up as a strong outlier. Therefore, such outliers will be overlooked in a high dimensional dataset.

LEARNING METHODS
Semisupervised Methods: When training data is either available for normal observations or outliers but not both the semisupervised methods are used to produce the boundaries for the known classes. For example, when the normal observations are known, an observation that falls outside the boundary of normal observations is an outlier. The oneclass support vector machine [9] is a commonly used for semi supervised outlier detection. Support vector machine (SVM) is a classification method using hyperplanes to partition a training dataset. Scholkopf et al [9] apply the kernel method to transform the data such that the points close to the origin are treated as another class. Therefore, the origin and the training data are used to construct the decision boundaries. The problem can also be generalized to the problem of classification with multiple classes by constructing multiple hyperplanes for each class. If a new observation does not belong to any class, it is an outlier.
Unsupervised Methods: In unsupervised methods, clustering methods are used to detect outliers. A clustering method groups similar observations using some objective function. Compared with the classification methods, the clustering methods can group data without training datasets. The clustering methods can be utilized to detect outliers by comparing the observations with the identified clusters. If an observation is far from the cluster centroids, it is declared an outlier. Although kmean is fast, it cannot detect clusters with different densities or shapes. In such cases, we can use densitybased clustering methods such as the kernel method, SNN [10] in order to identify cluster centroids. There are several limitations of using clustering methods for outlier detection. The goal of clustering techniques is to detect
clusters, not outliers. It is not optimal in outlier detection. For instance, an outlier close to a centroid can still be an outlier. An outlier possibly incorrectly unsigned as a normal observation, whereas a normal observation may be flagged as an outlier.
Distancebased and Densitybased: In order to overcome the limitations of statistical and clustering methods in outlier detection, Knorr et al introduce a pruning strategy to facilitate speed up the algorithm. The advantage of the method is that it can detect outliers without any assumption about the fundamental allocation of a dataset. However, in some applications, the outliers of interest may not be the farthest observations. An outlier can be characterized by its most similar observations instead. Consequently, Breunig et al [8] introduce a densitybased method that detects outliers with respect to their local densities. An observation that is far regarding its local region is think about an outlier. The method appears to be useful in practice.

LITERATURE SURVEY
The common problem of identifying outliers has been addressed by various approaches that can be approximately confidential as global versus local outlier models. An overall model shows by outlier model show the ways to a binary decision of whether or not a given data object is to finding outlier point. A local outlier idea was to a certain extent allocates a degree of outlierness to each object to an outlier factor is a value distinguishing each object in how much this object is an outlier. Many applications rank the outliers in the database and to get back the topn outliers a local outlier approach is apparently desirable. A different arrangement of outlier approaches discriminates between supervised and unsupervised approaches.
In this paper [11], author has tried to develop a better learning method to identify outliers out from normal observations. The concept of this learning method is to make use of local neighbourhood information of an observation to determine whether it is an outlier or not. To confine the neighborhood information precisely an idea local neighbourhood information concept called LPS is initiated to compute the anomalous degree of an apprehensive observation. Formally, the LPS are dependable with the perception of nuclear norm and can be acquired by the procedure of lowrank matrix approximation. Furthermore, distinct offered distancebased and densitybased detection methods the recommend technique is robust to the parameter k of kNN embedded within LPOD. Using this method they are effectiveness algorithms on applying various outlier data sets. Experimental outcomes give you an idea about that the LPS are good at ranking the most excellent candidates for individual outliers and the show of LPOD is capable at many characteristics. While LPOD make use of kNN to get neighbourhood information its competence relies on kNN and its concert will be influenced by the distance formulation of kNN to some area.
Here author has introduce a new outlier scoring method for finding local outliers to distinguish between inliers and outliers in the surrounding area of Continuous ID allows for inliers members of a subspace cluster with other members of the same cluster and distinguishes ability from nonmembers
of local outliers. Local outliers have a tendency to increase in the estimated value of their continuous intrinsic dimension at uncertainty point is modeled with continuous random variable. The proposed [12] author has local outlier achieve IDOS wellknown LOF outlier score can be summarized by explanation of the model of continuous intrinsic dimensionality initiated. Here author has to compare with IDOS; LOF is make public to have the potential for assessment of local density within local clusters i.e. groups of inliers than IDOS has in its measurement of local intrinsic dimensionality which would make it harder to discriminate outliers in the neighborhood of such clusters values of data objects. An experimental analysis shows that the correctness of IDOS considerably increase to finding outlier point based on scoring methods mainly when the real data sets are huge and highdimensional datasets show their superiority in terms of both effectiveness and efficiency.
In this paper, they propose [13] novel parameterfree approach anglebased outlier detection (ABOD) and several alternatives evaluating data points. This approach the consequences of the curse of dimensionality are alleviated on mining highdimensional data where distancebased approaches often fail to offer high quality results. The basic concept of this method ABOD, they proposed two alternatives: Fast ABOD as acceleration suitable for low dimensional but big data sets and LBABOD, a filter refinement approach as acceleration suitable also for high dimensional data. The main advantage of this proposed method on any constraint collection manipulating the feature of the accomplished ranking and here author has try to find rank the best candidates for being an finding outlier point with high precision and recall value. Here experimentalassessment has to compare anglebased outlier detection to the wellstarted distancebased technique LOF for a variety of artificial data set and a real life data set and give you an idea about anglebased outlier detection to achieve mainly well on highdimensional data.
As increasing dimensions of data objects it is difficult to find out data points which are not fitting in group i.e. cluster called outlier. This method is using to finding outlier point has significant in real life applications area of fraud detection, intrusion detection and various areas in which increasing data dimensions. Here author has to propose another method to divides original high dimensional data set in subspace clusters using subspace clustering method and here they try to improved kmeans algorithms outlier cluster is establish which is additional amalgamated with other clusters depending upon compromise task. Various outlier clusters which are not going to combine find final outlier cluster. Here author [14] investigates various researches over many concepts of high dimensional data mining, information retrieval to finding outlier point in multi dimensional data ensemble subspace clustering, spam detection, improved k means algorithm based on association rules. As these types of data is require to information systems so all these concepts can be used for improvement in data mining as well as machine learning methods. All these approaches are helpful for designing many strong applications for information retrieval. One application can be Spam Outlier Detection using Ensemble subspace clustering. In which spam outliers
in analysis dataset of ecommerce can be detected. In this subspace clustering can be done trailed by outlier detection and again ensemble with other subspaces for enormous accuracy. In progress if we append spam detection logic then there will not be any concern for fraud reviews by someone. Whatever clusters are recognized as an outlier cluster from high dimensional data sets these can be highlighted or in some cases make some authorized essential accomplishments against all these entities. Second is they can put into practice elimination logic in datasets so that while performing data analysis when outliers are detected primarily if coming data is belonging to same dimension set will be rejected form adding it to the database.
In this paper, here they propose [15] a hybrid semisupervised anomaly detection model for highdimensional data. Here author has using proposed detection model that consists of two parts: a deep auto encoder (DAE) and a together k nearest neighbor graph (KNNG) based anomaly detector. The deep auto encoder (DAE) is promoting from the ability of nonlinear mapping method and to begin with only trained the essential features of data objects in unsupervised mode and to transform into highdimensional data. In this method they are sharing of the training dataset is more dense in the compact feature dimensional data space to various nonparametric KNNbased detect anomaly detectors method with a part of a real life dataset rather than using the whole specific training set and this process greatly condenses the computational charge. Experimental results and statistical significance analysis shows that proposed method is evaluated on several reallife datasets and their performance confirms that the proposed hybrid model improves the anomaly detection accuracy and also they reduces the computational complexity than standalone algorithms.

PROPOSED METHODOLOGY
Input: A data set D, integer k, n, threshold , , and fk . Output: Top N outliers, and minimal number of interesting subspaces.
Step 1: Identify outliers in the complete space.

To get distancebased outliers but make use of the ranks of distance as an alternative of the unconditional expanse in outlier detection.

For each entity o, discover its knearest neighbors: nn1(o), . . . , nnk(o)

The weight of object o:

All objects are ranked in weight descending order.

Topl objects in weight are amount produced as outliers (l: userspecified parm)

Make use of spacefilling curves for rough calculation: scalable in both time and space w.r.t. data size and dimensionality.
Step 2: Dimensionality reduction

Occupations only when in lowerdimensionality standard illustrations can at rest be wellknown from outliers.

Conventionally PCA: Heuristically, the principal components with low variance are choosing for the reason that on such dimensions normal objects are expected close to each other and outliers frequently turn from the majority of classes in form of AClasses, B Classes and C Classes.
Step 3: Widening predictable outlier detection: Hard for outlier interpretation.
Step 4: To get acquire outliers in much lower dimensional subspaces: simple to take why and to what amount the thing is an outlier

E.g., find outlier data in certain subspace: calculate average subspace A classes>> average subspace B classes << average subspace C classes.

Calculate (D, , ) to find all interesting subspaces;

Rank the reported interesting subspaces in a list L;
Step 5: A local outlier factor and parzen window method for outlier detection.

Project records onto different subspaces to get an region whose concentration is much lower than average.

Discretize the information into a framework with equidepth (why?) sections.

Investigate for expanses that are considerably sparse

Think about a kd cube: k ranges on k dimensions with n objects.

If objects are autonomously deal out the anticipated number of objects falling into a kdimensional area is (1/ )kn = fkn, and compute the standard deviation is


The sparsity coefficient of cube C:

If S(C) < 0, C contains less objects than anticipated

The new negative the sparser C is and the further expected the objects in C are outliers in the subspace.


PROPOSED ARCHITECTURE
Figure: Overview of Learning Stage Outlier Detection

EXPERIMENTAL RESULTS
Select outlier points
it belong to the second class it belong to the first class
it belong to the second class it belong to the second class it belong to the second class it belong to the first class
it belong to the third class it belong to the third class it belong to the first class it belong to the third class
'Sepal Length' 'Sepal Width' 'Petal Length' 'Petal Width' [0.9586] [0.8796] [1] [1]
'Sepal Length' 'Sepal Width' 'Petal Length' 'Petal Width'
[0.9586] [0.8796] [1] [1]'Sepal Length' 'Sepal Width' 'Petal Length' 'Petal Width'
[0.9586] [0.1204] [1] [1]'Sepal Length' 'Sepal Width' 'Petal Length' 'Petal Width'
[0.9172] [0.7592] [1] [1] 
CONCLUSION
To finding outlier point it is an important measurement of machine learning task with many critical applications, such as medical diagnosis, fraud detection, and intrusion detection. Due to huge quantity of data objects in reallife applications outlier detection faces various challenges to find these outlier points, so as we reduce the dimensionality efficiently they enlarge data object value or they combine traditional algorithms to build strong approximation for both high dimensional data and low dimensional data. High dimensional data can be seen as part of the variety challenge of big data. Volume of data increases, but also the dimensionality: a large set of lowdimensional sensors can be seen as a highdimensional multivariate time series.
Here we proposed the concept of anomalous patterns where the outliers follow certain patterns. The anomalous patternsimply that the deviation may not just be random. In our experiments we have revealed that top outliers are not always interesting. We have presented the adaptive dualneighbor method to detect anomalous patterns based on the size and deviation. Our research analysis give you an idea about that our algorithm can discover interesting patterns which are undetected by clustering or outlier detection methods.
REFERENCES

L. Rokach. Ensemblebased classifiers. Artif. Intell. Rev., 33:139, 2010.

Z.H. Zhou. Ensemble Methods. Foundations and Algorithms. CRC Press, 2012.

A. Zimek, R. J. G. B. Campello, and J. Sander. Ensembles for unsupervised outlier detection: Challenges and research questions. SIGKDD Explor. 15(1):1122, 2013.

R. Wheeler and J. S. Aitken. Multiple algorithms for fraud detection. Knowledge Based Systems, 13(23):9399, 2000.

V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos. Anomaly detection in crowded scenes. In Proceedings of CVPR'10, pages 1975 1981, 2010.

S. Ramaswamy, R. Rastogi, and K. Shim. E_cient algorithms for mining outliers from large data sets. In Proceedings of SIGMOD'00, pages 427438, 2000.

S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. In Proceedings of ICDE'03, pages 315{326, 2003.

Breunig, M. M., Kriegel, H.P., Ng, R. T., and Sander, J., LOF:identifying densitybased local outliers, SIGMOD Rec., vol. 29, no. 2, pp. 93104, 2000.

Scholkopf, B., Platt, J. C., ShaweTaylor, J. C., Smola, A. J., and Williamson, R. C., Estimating the support of a highdimensional distribution, Neural Comput., vol. 13, no. 7, pp. 14431471, 2001.

Ertoz, L., Steinbach, M., and Kumar, V., Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data, in Proceedings of the third SIAM international conference on data mining, pp. 4758, Society for Industrial and Applied, 2003.

Huawen Liu, Member, IEEE, Xuelong Li, Fellow, IEEE, Jiuyong Li, Member, IEEE, and Shichao Zhang, Senior Member, IEEE Efficient Outlier Detection for HighDimensional Data IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, 2017.

Jonathan von Brunken, Michael E. Houle, and Arthur Zimek, Intrinsic Dimensional Outlier Detection in HighDimensional Data NII2015 – 003E, Mar. 2015.

HansPeter Kriegel, Matthias Schubert, Arthur Zimek AngleBased Outlier Detection in Highdimensional Data ACM 978160558193, 2008.

Suresh S. Kapare, Bharat A. Tidke, Spam Outlier Detection in High Dimensional Data: Ensemble Subspace Clustering Approach IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (3), 2015, 23262329.

Hongchao Song, Zhuqing Jiang, Aidong Men, and Bo Yang, A Hybrid SemiSupervised Anomaly Detection Model for High Dimensional Data Comput Intell Neurosci. 2017.