Outlier Detection Methods--- An Analysis

Pragyan Paramita Das; Maya Nayak

doi:10.17577/IJERTV2IS90377

Volume 02, Issue 09 (September 2013)

Outlier Detection Methods— An Analysis

DOI : 10.17577/IJERTV2IS90377

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 524
Total Downloads : 256
Authors : Pragyan Paramita Das, Maya Nayak
Paper ID : IJERTV2IS90377
Volume & Issue : Volume 02, Issue 09 (September 2013)
Published (First Online): 16-09-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Outlier Detection Methods— An Analysis

Pragyan Paramita Das, Maya Nayak

Asst. Professor, Dept. of I.T. Orissa Engineering College

Vice Principal and H.O.D., Dept. of I.T. Orissa Engineering College

Abstract

An outlier is an extreme observation that is considerably dissimilar from the rest of the objects. The detection of outlier is helpful in many applications such as data cleaning, network intrusion, credit card fraud detection, telecom fraud detection, customer segmentation, medical analysis etc. Outliers behave very differently from the rest of the observations in the dataset. Outliers are mostly removed to improve the accuracy of the predictions. But, the presence of an outlier can have certain meaning also. In our work we compare detection of outlier techniques based on statistical method, density based method, distance based method and deviation based.

Keywords

Outlier detection, statistical method, density based method, deviation based method, distance based method, artificial intelligence, fuzzy logic, neural network.

1 Introduction

In literature different definitions of outlier are given:

An outlier is an observation that deviates so much from other observations as to arouse suspicions that is was generated by a different mechanism (Hawkins, 1980).
An outlier is an observation which appear to be inconsistent with the remainder of the dataset (Barnet & Lewis, 1994).
An outlier in a set of data is an observation or a point that is considerably dissimilar or inconsistent with the remainder of the data

(Ramasmawy at al., 2000).

Outlier analysis has assumed importance due to the fact that an outlier represents a critical component of information in a data set. In a credit card transaction an outlier may mean credit card theft or misuse. In public health an outlier may frequently mean a new disease in the patient. Similarly in a computer network it could mean a hacked computer and a compromised system. Therefore, detecting outlier is an important data mining task.

Some of the prominent causes for outliers are mentioned below:
1. Human error: Outliers may exist due to a faulty reporting system or human error.
2. Environmental change: A new buying pattern amongst the customers or change in the nature of the environment itself can cause the presence of outliers.
3. Error in instrument: Defects in the instruments which are used to measure and report may be a cause for outliers.
4. Malicious activity: As already mentioned above a credit card fraud, hacking into a network system may be a cause for outliers.
Type I Outliers.

In a particular dataset an individual outlying instance is termed as a Type I outlier[1]. That single point of data is an outlier because of its attribute values which is inconsistent with values taken by normal instances. Many of the existing outlier detection scheme focus on this single outlier. These techniques analyze the relationship of this single point of data with regard to the rest of the points in the dataset. For example, in credit card data or medical data each data represents a single transaction or to a single patient.

Type II Outliers.

These outliers are caused because of the occurrence of an individual data instance in a

specific context in the given data. These outliers are also individual data instances. But these type II outliers are in the context of a particular dataset especially in relation to its structure and problem formulation.

Type III Outliers.

These are not individual observations but rather are an entire subset of the entire dataset which are outliers. Their occurrence together constitute an anomalous formulation. They are usually meaningful when the data has a sequential nature. These outliers are either subgraphs or subsets occurring in the data.

There are four basic methods for the detection of outliers. They are the statistical method, deviation method, density method and the distance method. Each of these methods is explained below in some detail.

2 Methods of Outlier Detection
This technique identifies outliers by examining the main characteristic of objects in a group. Those objects that deviate from this description are considered to be outliers. There are primarily two techniques for deviation based methods i.e.

sequential exception technique and OLAP Data Cube technique.

Sequential exception technique

This stimulates the way in which humans can distinguish unusual objects from a series of objects. It uses the implicit redundancy of the data. Given a data set D of n objects, it builds a sequence of subsets i.e. {D1, D2, Dm} with 2<=m<=n such that Dj-1 C DJ , where Dj D. This technique works through selection of a sequence of subsets from the set for analysis. For every subset, it determines the dissimilarity difference of the subset with respect to the preceding subset in the sequence.

OLAP Data Cube technique

The OLAP approach uses data cubes to identify regions of anomalies or outliers in large multidimensional data. This process is also overlapped with cube computation. In this approach, precomputed measures indicating data exceptions are used to guide the user in data analysis. A cell value in the cube is considered to be an exception if it is significantly different for the expected value which is based on the statistical model. This method uses the background colour to reflect the degree of exception of each cell. The measure value of a cell can reflect exceptions occurring at more detailed or lower levels of the cube, where these exceptions are not visible from the current level.

3 Other Methods to detect outliers

New forms of outlier detection methods have been developed in recent times. Some of the methods are given below:
Distance between each element and the centroid of the overall distribution normalized with respect to the average value. (dist)
Fraction of the total number of elements that are near to the pattern itself. (n-points)
Mean distance between the considered pattern and the remaining patterns normalized with respect to the maximum value. (memb-deg)
Degree of membership of the patterns to the cluster to which it has been assigned by the preliminary fuzzy c-means clustering stage.

(mean-dist)

In geometry the centroid of an object X in n- dimensional space is the intersection of all hyperplanes that divide X into two parts of equal moment about the hyperplane. Roughly speaking, the centroid is a sort of average of all points of X.

The FIS is of Mandami type (Mandami & Assilian, 1975) and the FIS output variable, named outlier- index (outindx), is defined in the range [0;1]. The output function provides an indication on the risk that the considered pattern is an outlier.The inference rules relating the output variable to the four inputs is formulated through a set of 6 fuzzy rules, that are listed below:

IF (dist is very high) AND (n-points is very small) AND (memb-deg is low) AND (mean-dist is big) THEN (outindx is very high).
IF (dist is medium) AND (n-points is small) AND (memb-deg is quite low) AND (mean-dist is small) THEN (outindx is quite high).
IF (dist is low) AND (n-points is medium) AND (memb-deg is quite low) AND (mean-dist is very small) THEN (outindx is low).
IF (dist is medium) AND (n-points is very small) AND (memb-deg is quite low) AND (mean- dist is small) THEN (outindx is quite high).
IF (dist is low) AND (n-points is small) AND (memb-deg is high) AND (mean-dist is quite big) THEN (outindx is low).
IF (dist is low) AND (n-points is medium) AND (memb-deg is high) AND (mean-dist is small) THEN (outindx is low).

Artificial intelligence techniques

When dealing with industrial automation, where data coming from the production field are collected with different and heterogeneous means, the occurrence of outliers is more the rule than the exception. Standard outlier detection methods fail to detect outliers in industrial data because of the high dimensionality of the data. In these cases, the use of artificial intelligence techniques has received increasing attention in the scientific and industrial community, as the application of these techniques shows the advantage of requiring poor or no a priori theoretical assumption on the considered data. Moreover their implementation is relatively simple and with no apparent limitation on the dimensionality of the data.[9]

4 Evaluation of outlier detection techniques

Evaluation of an outlier detection technique is very important to establish its usefulness in detecting outliers in a given data set. However there are several techniques that requires a number of parameters that need to be determined empirically. An evaluation metric is required in such cases to determine the best values for the involved

parameters. Outliers can be detected in a given dataset as well as in an application domain.

Detecting outliers in a given data set

The objective of any outlier detection technique is to detect outliers in a data set. For this purpose, involves running the procedure on a validation set and seeing how well the technique detects the outliers. A labelling type of outlier detection technique is typically evaluated using any of the evaluation techniques from 2-class classification literature.[10] First of all a benchmark data set is chosen. One of the primary requirements is that the outliers should be labelled in the data set. The evaluation data is then split into training and testing data sets by using techniques such as hold- out, cross-validation, jack-knife estimation. The outlier detection technique is then applied to the test part of the validation data and the instances are labelled as outlier data points or normal data points. Then the predicted labels are compared with the actual labels to construct a confusion matrix as given below:

Actual

Predicted

Outlier

Normal instances

Outlier

Ot

Of

Normal instances

Nf

Nt

A confusion matrix generated after running an outlier detection technique on validation data is constructed using the above quantities, represented by f (Ot , Of , Nf , Nt , , C ). represents the parameters associated with the outlier detection technique and C is a cost matrix that assigns weights of each of the four quantities.

For scoring type of outlier detection techniques, there is typically a threshold parameter to determine the cut-off above which the instances are treated as outliers. The threshold parameter is either determined using the outlier scores or left for the users to choose. After applying this cut-off, the confusion matrix is constructed and the evaluation metric is computed for the given value of threshold (incorporated in ).

Parametric statistical techniques are typically evaluated on artificially generated data sets from known distributions. Outliers are artificially injected in the data ensuring that they do not belong to the distribution from which rest of the samples are generated. The data sets containing rare class are chosen for this purpose and the instances belonging to the rare class are treated as outliers in the data. Another technique is to take a labelled data set and remove instances of any one class. This reduced subset then forms the normal data. All or few instances from the removed class are injected in the normal data as outliers.

But, most of the available benchmark data sets are adapted to be used for evaluating Type I outlier detection techniques. No data sets are available to evaluate Type II outlier detection techniques. For Type III outlier detection techniques there are no publicly available benchmark data sets, that can be

used to evaluate any such technique. Several outlier detection techniques are evaluated for other objective functions such as scalability, ability to work with higher dimensional data sets, ability to handle noise. Those metrics which are described above are used for such evaluation. The only difference is the choice of data sets that can capture the complexity being evaluated. [11]
Evaluation in application domain

Such evaluation is done by choosing a validation data set that represents a sample belonging to the target application domain. The validation data should have labelled entities that are considered to be outliers in the domain. A key observation here is that such evaluation measures the performance of the entire outlier detection setting which also includes the technique, the features chosen and other related parameters /assumptions. Some of the popular application domains have benchmark data sets. Such benchmark data sets allow a standardized comparative evaluation of outlier detection techniques and hence are very useful. But often times the lack of such benchmark data sets have forced researchers to evaluate their techniques on proprietary or confidential data sets. Such data sets are not available publicly. Sometimes labelled validation data is often not available at all. In such cases a qualitative analysis is performed, which typically involves a domain expert.[12]
5 Conclusion

Outlier detection methods can be divided between univariate methods, which have been proposed in the earlier works in this field, and multivariate

methods that usually form most of the current body of research. Another fundamental taxonomy of outlier detection methods is between parametric (statistical) methods and nonparametric methods that are model-free. Statistical parametric methods either assume a known underlying distribution of the observations or, at least, they are based on statistical estimates of unknown distribution parameters. These methods flag as outliers those observations that deviate from the model assumptions. But they are not effective in the case of high dimensional data sets or for arbitrary data sets where there is no prior intimation of the data distribution. Within the class of non-parametric outlier detection methods one can set apart the data-mining methods, also called distance-based methods. These methods are usually based on local distance measures and are capable of handling large databases. Another class of outlier detection methods is founded on clustering techniques, where a cluster of small sizes can be considered as clustered outliers which identifies both high and low density pattern clustering, further partition this class to hard classifiers and soft classifiers.

But the recent techniques like the artificial intelligence techniques present the advantage of requiring no a priori assumption on the considered dat. Newer methods like Fuzzy Logic-based method outperforms the most widely adopted the traditional methods. Future work on the FIS- based outliers detection strategy will concern the algorithm optimization in order to improve its efficiency and its on-line implementation. Moreover further tests have to be performed on

different applications. This will improve the detection capability of outliers.

References
1. Outlier Detection: A survey by Varun Chandola, Arindam Banerjee, and Vipin Kumar of the University of Minnesota.
2. Outlier analysis by Charu C. Aggarwal, IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
3. A Meta analysis study of outlier detection methods in classification by Edgar Acuna and Caroline Rodriguez Department of Mathematics, University of Puerto Rico at Mayaguez Mayaguez, Puerto Rico 00680
4. Outlier Detection by Irad Ben-Gal, Department of Industrial Engineering, Tel Aviv University, Ramat- Aviv, Tel Aviv, Israel, 69978.
5. Outlier Detection Methods for Industrial Applications by Silvia Cateni, Valentina Colla and Marco Vannucci Scuola Superiore Sant Anna, Pisa, Italy.
6. Balance Sheet Outlier Detection Using a Graph Similarity Algorithm, University of Virginia; Stevens Institute of Technology, University of Virginia – Systems Engineering (2013)
7. Hartwig, J. and J.-E. Sturm (2012): An outlier- robust extreme bounds analysis of the determinants of health-care expenditure growth, KOF Working Papers No. 307, June, Zurich.
8. T. de Vries, S. Chawla, M. E. Houle Density- preserving projections for large-scale local anomaly detection Knowledge and Information Systems (KAIS), 32(1): 2552, 2012.
9. H.-P. Kriegel, P. KrÃ¶ger, E. Schubert, A. Zimek Outlier Detection in Arbitrarily Oriented Subspaces In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM), Brussels, Belgium: 379388, 2012.
10. T. de Vries, S. Chawla, M. E. Houle Finding Local Anomalies in Very High Dimensional Space In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM), Sydney, Australia: 128137, 2010.
11. H.-P. Kriegel, P. KrÃ¶ger, E. Schubert, A. Zimek Interpreting and Unifying Outlier Scores In Proceedings of the 11th SIAM International Conference on Data Mining (SDM), Mesa, AZ: 13 24, 2011.
12. H. V. Nguyen, V. Gopalkrishnan, I. Assent An Unbiased Distance-based Outlier Detection Approach for High-dimensional Data In Proceedings of the 16th International Conference on Database Systems for Advanced Applications (DASFAA), Hong Kong, China: 138152, 2011.

Volume 02, Issue 09 (September 2013)

Outlier Detection Methods— An Analysis

Outlier Detection Methods— An Analysis

Statement of the Problem

Organization of this paper

Literature Survey

Types of outliers

Statistical method

Distance-based method

Density-based method

Deviation based outlier techniques

Neural network method

Fuzzy logic

Artificial intelligence techniques

Detecting outliers in a given data set

Evaluation in application domain

Leave a Reply

		Actual
Predicted		Outlier	Normal instances
	Outlier	Ot	Of
	Normal instances	Nf	Nt