 Open Access
 Total Downloads : 571
 Authors : Nishith Kumar , Mohammed Nasser
 Paper ID : IJERTV1IS6305
 Volume & Issue : Volume 01, Issue 06 (August 2012)
 Published (First Online): 30082012
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
A New Graphical Multivariate Outlier Detection Technique Using Singular Value Decomposition
Nishith Kumar1 , Mohammed Nasser2
1Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh;
2Department of Statistics, Rajshahi University, Rajshahi, Bangladesh.
Abstract
It is well recognized that even a high quality data set
Leroy (1987) proposed robust distances, which are robustified versions of the Mahalanobis distances
tends to contain a remarkable percent of unusual observations. There are various techniques to detect multivariate outliers. But they are highly depends on
RDi
(xi
tn ) Cn
1 (x
tn )
i
mathematics. We propose a new graphical technique to detect multivariate outliers which is easy to understand
with (tn , cn ) robust estimates of location and scatter. Observations with RDi bigger than the critical value
k
without hard mathematics, it can be applied to data for
both supervised and unsupervised learning, it can be directly applied to separate extreme outliers from general outliers.
Keywords: Outliers, Singular Value Decomposition, Principal Component Analysis.

Introduction
Outliers detection problem is as old as statistics. Outliers present in both supervised and unsupervised learning of multivariate data set. Outlier can destroy our analysis. So outliers detection should be the first target of a statistician or a researcher. In this paper we mainly proposed a method for outliers detection in a multivariate data set and apply this in several well known data sets.
In multivariate statistics we detect outliers by Mahalalanobis Distance but it is not suitable for outlier detection since it is dependent on nonrobust mean and covariance matrix. To detect outliers, Rousseeuw and
2 ,0.975 can be considered as potential outliers (Rousseeuw and Van Zomeren, 1990).
Regression analysis is one of the techniques of multivariate statistics. By regression analysis we can analyze business, economics and social science data. There are some existing techniques to detect outliers by using Regression analysis. In regression analysis statisticians mainly follow two ways. (i) After fitting classical least square lines they detect outliers in Y direction by standardized residuals, studentized residuals (Srikantan; 1961), deletion studentized residual (Ellenberg; 1976) and outliers in X direction by high leverage values. (ii) The robust techniques that are commonly used in the identification of multiple outliers are least median of squares (LMS) (Rousseeuw; 1984), least trimmed squares (LTS)(Rousseeuw; 1984) and reweighted least squares (RLS)(Rousseeuw and Leroy; 1987) etc. In logistic regression outliers are detected by generalized standardized pearson residual (GSPR) (Hadi and Simonoff, 1993; Atkinson, 1994; Munier, 1999 and
Imon, 2005).

Singular Value Decomposition
The singular value decomposition (SVD) can be viewed as the extension of the eigenvalue
decomposition for the case of nonsquare matrices. It shows that any real matrix can be diagonalized by using two orthogonal matrices. The eigen value decomposition, instead, works only on square matrices and uses only one matrix ( and its inverse) to achieve diagonalization. If the matrix is square and symmetric, then the two orthogonal matrices of SVD become equal, and eigen value decomposition and SVD become one and same thing. Because the SVD is much more general than the eigen value decomposition and intimately related to the matrix rank and reduced rank least square approximations, it is a very important and useful tool in matrix theory, statistics and signal analysis. It can be used as a data reduction technique.
Singular value decomposition, specially its low rank approximation property is an elegant part of modern matrix theory. After its inception (1936)[10] its two ways fascinating data reduction capacity remained unnoticed till the last quarter of last century. Since then
statisticians have been showing increasing interest to
or even tens of thousands of genes. Gene expression data are currently rather noisy, and SVD can detect and extract small signals from noisy data. Since SVD can reduce data in both wayscolumns (generally indicates variables) and rows (generally indicates cases), and is more numerically stable, and moreover, PCA can be undertaken as a by product of SVD, in modern research it is being used more frequently in place of classical PCA for data compression ( Diamantaras and Kung,1996;)[16] , clustering (Murtagh,2002;)[17] and multivariate outliers detection (Penny and Jolliffe,2001;)[18].

Low Rank Approximation of SVD
Low rank approximation (C. Eckart and G. Young, 1936)[10] is an important properties of SVD. It has an wonderful data reduction capacity with minimum recovery error. We can reduce variables as well as observations by using SVD. If X is mÃ—n matrix of rank k min(m,n). Then by singular value decomposition we can write,
SVD for principal component analysis (PCA), canonical correlation analysis (CCA) and cluster analysis. Principal component analysis (PCA), often
X U V T
(1)
performed by singular value decomposition (SVD), is a popular analysis method that has recently been explored as a method for analyzing largescale expression data (Raychaudhuri et al., 2000; Alter et al., 2000)[11,12]. Additionally SVD/PCA has been used to identify highamplitude modes of fluctuations in macromolecular dynamics simulations (Garcia, 1992; Romo et al., 1995)[13,14], and identify structural
intermediates in lysozyme folding using smallangle
where U is the column orthonormal matrix whose columns are the eigen vectors of XX T , is the diagonal matrix contain the singular values of X and V is the orthogonal matrix whose columns are the eigen vectors of X T X .
From (1) we can write
scattering experiments (Chen et al., 1996)[15]. One of the challenges of bioinformatics is to develop effective
T
X u v
1 1 1
T
u v
2 2 2
T
u v .
k k k
ways to analyze global gene expression data. A rigorous approach to gene expression analysis must
T
involve an upfront characterization of the structure of
~
Suppose we approximate X by X
min(m,n).
whose rank is r< k
the data. In addition to a broader utility in analysis methods, singular value decomposition (SVD) and principal component analysis (PCA) can be valuable tools in obtaining such a characterization. SVD and
PCA are common techniques for analysis of
~ T
X 1u1v1
~
X U r
2u2 v2
V
T
T
r r
r ur vr .
multivariate data, and gene expression data are well suited to analysis using SVD/PCA. A single microarray
where U r
is mÃ—r, r is a diagonal matrix of
experiment can generate measurements for thousands,
order r and Vr is n Ã— r. Now post multiply
Vr in both side we have,
~
XVr U r r
its first column represents the first PC, second column represents the second PC and so on. Hence we see that
~
outside the box. So in our above graph we can say that 6, 25, 30, 32, 34, 36, 102, 104 111 are unusual
observations. Hubert, Rousseeuw and Branden(2005)[19] declared observations 25, 30, 32,
34, 36, 102111 as outliers by using ROBPCA.
X is a mÃ—n matrix but
XVr
is a mÃ—r. Generally n
represents no. of variables, so it reduces data by minimizingno. of variables.

SVD Based Outlier Detection Method
We develop a graphical method of outliers detection using SVD. It is suitable for both general multivariate data and regression data. For this we construct the scatter plots of first two PCs, and first PC and third PC. We also make a box in the scatter plot whose range lies median(1stPC) Â± 3 Ã— mad(1stPC) in the Xaxis and median(2ndPC/3rdPC) Â± 3 Ã— mad(2ndPC/3rdPC) in the Yaxis. Where mad = median absolute deviation. The points that are outside the box can be considered as extreme outliers. The points outside one side of the box is termed as outliers. Along with the box we may construct another smaller box bounded by 2.5/2 MAD line.

Example
In this section we will use our method in some real data sets. These datasets are well known in multivariate analysis and regression analysis.
5.1 Car Data
Our first example is the lowdimensional car data set which is available in SPLUS as the data frame cu.dimensions. For n=111 cars, p=11 characteristics were measured such as the length, the width and the height of the car. Using our method we get the Figure
1. From Figure1(a) we see that observations 25, 30, 32, 34, 36, 102, 104, 107, 108, 110 and 111 are outside the box by the two groups. Also from Figure1(b) we see that the observations 6, 102 and 104111 are
Figure 1. Scatter plot of car data (a) scatter plot of first two PCs and (b) scatter plot of first and third PC.

HawkinsBraduKass (1984) Data
Hawkins, Bradu and kass (Hawkins et al., 1984)[20] constructed an artificial threepredictor data set containing 75 observations with 14 influential observations. Among them there are ten high leverage outliers (cases 110) and for high leverage points (cases 1114) Imon (2005)[21]. If we apply our method in this data then we get the Figure2. From Figure2 we see that observations 114 are outside our box so observations 114 are unusual observations. Also we see that three clusters are present in the data. Observations 110 make 1st cluster, observations 1114 make second cluster and the rest observations make third cluster in figure2(a). Index plot of standardized residuals obtained from LMS (Rousseeuw and Leroy, 1987)[5] is out performed by our plot.
Figure 2. Scatter plot of Hawkins, Bradu and kass data
(a) scatter plot of first two PCs and (b) scatter plot of first and third PC.

Modified Brown Data
We first consider the data set given by Brown (1980)[22]. Here the main objective was to see whether an elevated level of acid phosphates (A.P.) in the blood serum together would be of value for predicting whether or not prostate cancer patients also had lymph node involvement (L.N.I). Ryan (1997)[23] pointed out that the original data on the 53 patients which contains
1 outlier (observation number 24). Imon and Hadi(2005)[9] modified this data set by putting two more outliers as cases 54 and 55. Also they showed that observations 24, 54 and 55 are outliers by using generalized standardized Pearson residual (GSPR) (Hadi and Simonoff,1993; Atkinson, 1994; Munier, 1999; Imon, 2005) [6,7,8,9]. Now we apply our method in this data. Applying our method we get the figure 3. From figure3 we see that observations 24, 54, 55, 53 and 25 are detected as outliers by our method.
Figure3. Scatter plot of modified Brown data (a) scatter plot of first two PCs and (b) scatter plot of first and third PC.

Advantages Our Method
Our method has the following advantages over other competitive methods;
It is easy to understand without hard mathematics.
It can be applied for both regression data and other type of multivariate data.
It is directly applied to separate extreme outliers from general outliers.
It can detect several clusters that other outliers detection methods fail to pinpoint.
It can single out not only outlying observations but also outlying variables.

Conclusion
Form our above discussion we conclude that the proposed method is very much helpful for multivariate outlier detection. By using this method we can also see the structure of multivariate data graphically. Though there are several existing methods for detecting
multivariate outliers but SVD based technique is better than those methods.

Reference

Rousseeuw P.J. and Leroy A. (1987). Robust Regression and Outlier Detection. New York: Wiley. doi:10.1002/0471725382

Rousseeuw P.J., Van Zomeren B.C. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association. Vol. 85(411), pp. 633651.

Srikantan K. S.(1961). Testing for the single outlier in a regression model. Shankhya, vol23, Series A, 251 260.

Ellenberg J.H.(1976). Testing for single outlier from a general regression. Biometrics, 32, 637 645. doi:10.2307/2529752

Rousseeuw P.J. (1984). Least median of squares regression. Journal of the American Statistical Association, 79, 871 – 880.

Hadi A.S. and Simonoff J.S. (1993). Procedure for the identification of multiple outliers in linear models. Journal of the American Statistical Association, 88, 1264 – 1272. doi: 10.1080/01621459.1993.10476407

Atkinson A.C. (1994). Fast very robust methods for the detection of multiple outliers. Journal of the American Statistical Association 89; 1329 – 1339. doi:10.1080/01621459.1994.10476872

Munier S. (1999). Multiple outlier detection in logistic regression. Student 3, 117 126.

Imon A.H.M.R. and Hadi Ali S. (2005). Identification of Multiple Outliers in Logistic Regression. International Statistics Conference on Statistics in the Tecnological Age, Institute of Mathematical Sciences, University of Malaya, Kualalumpur, Malaysia, December 2005. doi:10.1080/03610920701826161

Eckart C. and Young G. (1936). The approximation of one Matrix by another of lower Rank. Psychometrika,1,211218. doi: 10.1007/BF02288367

Raychaudhuri S., Stuart J.M. and Altman R.B. (2000). Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac Symp Biocomput 2000:45566.

Alter O., Brown P.O. and Botstein D. (2000). Singular value decomposition for genome wide expression data processing and modeling. Proc Natl Acad Sci USA 2000; 97:1010106. doi:10.1073/pnas.97.18.10101

Garcia A.E. (1992). LargeAmplitude Nonlinear Motions in Proteins. Phys Rev Lett 1992; 68:2696 99. doi: 10.1103/PhysRevLett.68.2696

Romo T.D., Clarage J.B., Sorensen D.C. and Phillips G.N. Jr., (1995). Automatic identification of discrete substates in proteins: singular value decomposition analysis of timeaveraged crystallographic refinements. Proteins 1995; 22:311 21. doi:10.1002/prot.340220403

Chen L., Hodgson K.O. and Doniach S. (1996). A lysozyme folding intermediate revealed by solution Xray scattering. J Mol Biol 1996; 261:65871. doi:10.1006/jmbi.1996.0491

Diamantaras K.I. and Kung S.Y. (1996). Principal Components Neural Netwarks: Theory And Applications, John wiley & sons, Inc. N.Y 4546.

Murtagh F.(2002). Clustering in Highdimensional Data Spaces. Classification, Clustering and Data Analysis, eds. Jajuga,K., Sokolowski,A. and Bock,H., SpringerVerlag, Berlin, pp.8996.

Penny K. I. and Jolliffe I. T. (2001). A comparison of multivariate outlier detection methods for clinical laboratory safety data. Royal Statistical Society, part 3 , PP.295308. doi:10.1111/1467 9884.00279

Hubert M., Rousseeuw P.J. and Branden K.V. (2005). ROBPCA: a New Approach to Robust Principal Components Analysis. Technometrics, 47, 64 79. doi:10.1198/004017004000000563

Hawkins D. M., Bradu D. and Kass G.V.(1984),Location of several outliers in multiple regression data using lemental sets. Technometrics, 20, 197208. doi:10.1080/00401706.1984.10487956

Imon A. H. M. R. (2005). Identifying multiple influential observations in linear Regression. Journal of Applied Statistics 32, 73 – 90. doi:10.1080/02664760500163599

Brown B.W., Jr. (1980). Prediction analysis for binary data. in Biostatistics Casebook, R.G. Miller, Jr.,
B. Efron, B. W. Brown, Jr., L.E. Moses (Eds.), New York: Wiley.

Ryan T.P. (1997). Modern Regression Methods, Wiley, New York.