 Open Access
 Total Downloads : 619
 Authors : Alvina Anna John, S. Deepajothi
 Paper ID : IJERTV2IS4940
 Volume & Issue : Volume 02, Issue 04 (April 2013)
 Published (First Online): 23042013
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Privacy Preservation Of Data Sets In Data Mining
1Alvina Anna John 2S.Deepajothi
1PG Scholar 2Assistant Professor,Chettinad College Of Engineering & Technology,Karur
Abstract
Data mining is the process of extracting or mining knowledge from large databases. Privacy preserving data mining plays an important role in the areas of data mining and security. In the area of privacy preserving data mining, the data mining algorithms are analyzed for their impact on data privacy. The goal of privacy preserving data mining is to develop algorithms to modify the original data set so that the privacy of confidential information remains preserved and as such, no confidential information could be revealed as a result of applying data mining tasks. The existing privacy preservation technique is the data set complementation approach and it fails if all data sets are leaked as the data set reconstruction algorithm is generic. The proposed method provides privacy preservation by converting the original sample data sets in to a group of unreal data sets and then applying cryptographic privacy protection to
sensitive values. The cryptographic technique implemented is RSA. This method provides privacy preservation with improvement in accuracy. This work covers the application of new privacy preserving approach with the NaÃ¯ve Bayesian classification algorithm.

Introduction
Data mining is a method that helps to extract useful information from large databases. It is the technique of extracting relevant data from giant databases through the utilization of data mining algorithms. As the quantity of information doubles each year, data mining is becoming an increasingly important tool to transform this data into information. Data mining deals with massive databases which can contain sensitive information. It needs data preparation which can discover information or patterns which may compromise confidentiality and privacy
obligations. Advancement of efficient data mining technique has enlarged the risks of revealing sensitive data. Providing security to sensitive information against unauthorized access has been a long term objective for the database security research community and for the government statistical agencies. Hence, the protection issue has become an important area of research in data mining. The releasing of personal data in its most specific state poses a threat to the privacy of an individual. The threat to an individual's privacy happens when anyone who has access to the newlycompiled data set is able to identify specific individuals. The disclosure of extracted patterns open up the danger of privacy breaches that may reveal sensitive information to malicious users thus causing privacy violation in data mining. Hence there is a need for privacy preservation in data mining. Privacy preservation in data mining (PPDM) is the process of providing security to sensitive data in a database against unauthorized access. PPDM is implemented in order to protect the sensitive information and to prevent violation of privacy.

Related Work
M. Kantarcioglu, et.al gives a discussion about the concept of privacy violation in data mining [10].This paper defines that privacypreserving data mining has concentrated on obtaining valid results when the input data is private. Alexandre Evfimievski et.al refers Privacypreserving data mining (PPDM) as the area of data mining that seeks to safeguard sensitive information from unsolicited or unsanctioned disclosure[2]. The term privacy preserving data mining was introduced in the papers Agrawal et.al [3] and Lindell et.al[4]. These papers considered two fundamental problems of PPDM: privacy preserving data collection and mining a data set partitioned across several private enterprises. Latanya Sweeney et.al proposed the k anonymity as a model for protecting privacy[5]. This paper discusses kanonymity protection and also examines reidentification attacks that can be realized on releases that adhere to kanonymity unless accompanying policies are respected. Ashwin
Machanavajjhala et.al proposed the technique lDiversity as Privacy preservation beyond k Anonymity[6]. This paper provides
a detailed analysis of the two attacks on kanonymity called Homogeneity Attack and Background Knowledge Attack and provides a powerful privacy denition called ldiversity. Ninghui Li et.al gives a detailed study on a privacy preservation technique called tCloseness as a Privacy Beyond kAnonymity and lDiversity[7]. This paper proposes a novel privacy notion called t closeness, which requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table. E. Poovammal et.al provides a detailed study on different privacy preservation techniques like data publishing, k anonymity, ldiversity and a privacy preserving model[8].This paper discuss about the advantages and disadvantages of kanonymity, advantage of ldiversity over k anonymity and finally the privacy preserving model which performs privacy preservation based on the data type. If it is numerical data type, transformation is performed by categorical membership values and if categorical by mapping values.

Our Contribution
We present a privacy preserving scheme which involves the cryptographic type of privacy for the unrealized data sets constructed from the original data sets. The cryptographic method implemented is the RSA approach. Then the accuracy evaluation is performed with the help of NaÃ¯ve Bayesian classification algorithm. The phases involved are as follows: Data Set PreProcessing
It is a very important
step in data mining process. Data processing techniques, when applied before mining, can substantially improve the overall quality of patterns mined & the time necessary for actual mining. Data preprocessing techniques can improve the quality of the data, accuracy & efficiency of the mining process.
Unrealized Data Set Construction
To unrealize the samples, we start with both set of input sample data set and perturbing data set as empty sets. With respect to the procedure described above, universal data set is added as a parameter of the function because reusing precomputed universal data set is more efficient than recalculating universal data set and this helps to save time of recalaculation. The recursive
function unrealized trainingset takes one data set in input sample data set in a recursion without any special requirement; it then updates perturbing data set and set of output training data sets correspondent with the next recursion[1].The original data set can be reconstructed if both of the unrealized data sets, i.e., both TP and T leaks out, so we go for
cryptographic protection. The algorithm for unrealization of data sets is as follows[1].
INPUT : Ts, TU , TP , T
OUTPUT : TP , T

We invoke the algorithm with TP and T as empty sets.

Take t which is a dataset in Ts

If t is not an element of TP or TP
={t} then

TP = TP + TU

TP =TP {t}

t is the most frequent dataset in TP

Return the values (Ts{t} , TU , T
+ {t} , TP {t})
Cryptographic Privacy
Once the unrealized data sets are obtained the cryptographic approach of RSA is applied on the attributes of unrealized data set for enhanced protection and to prevent the leakage of unrealized data sets. The algorithm for cryptographic privacy is as follows[14].
INPUT : ATTRIBUTE AGE
(m) OF TP +T
OUTPUT : ENCRYPTED ATTRIBUTE AGE (c)

Choose two distinct prime numbers p and q.

Find n such that n = pq.

Find the totient of n, (n)=(p 1)(q1).

Choose e such that 1 < e < (n)

e and (n) share no divisors other than 1.

e is the public key which is used for encryption.


Determine d such that de 1 (mod (n)).

d is the public key used for decryption


Encryption : c me (mod n).

Decryption : m cd (mod n).
Evaluation Using NaÃ¯ve Bayesian
NaÃ¯ve Bayesian[13] is a statistical classifier which performs probabilistic prediction, i.e., predicts class membership probabilities. It is based on Bayes theorem which is as follows: P(HX)=[P(XH)P(H)]/P(X)
Let D denote a set of tuples and associated class labels. The tuple is denoted as X and there exists m classes C1,C2Cm.Applying Bayes theorem we get
The class with the largest value of P(XCi)P(Ci) is assigned as the class value of the tuple.The accuracy is calculated using the following formula
Fig. 1. System Architecture

Results And Discussions
The data set taken for evaluation is heart data set with 574 records and 7 attributes. The records with missing values are deleted as the preprocessing step, then the unrealization of data set is performed followed by the implementation of RSA over the attribute age for cryptographic protection in order to prevent the leakage of original data set from the unrealized data sets. Now the NaÃ¯ve Bayesian classification algorithm is applied for evaluation of accuracy over the original data set, unrealized data sets and the cryptographic data set. The cryptographic privacy protected
data set is able to give improved accuracy than the existing system.
Fig. 2. Accuracy Comparison

Conclusion And Future Works
In this paper privacy preservation of data set is efficiently provided by the means of combination of unrealization of data sets and cryptographic approach. This work presents a new privacy preserving approach which removes each sample from a set of perturbing data set followed by a cryptographic approach with RSA. In future in order to improve the privacy preservation and the mining efficiency, an effective privacy preserving distributed mining algorithm of association rules can be proposed.

References

Privacy Preserving Decision Tree Learning Using Unrealized Data Sets,Pui K. Fong and Jens H. Weber Jahnke, Senior Member, IEEE Computer Society, IEEE Transactions on knowledge and data engineering, vol. 24, no. 2, 2012.

Alexandre Evfimievski :Privacy Preserving Data Mining by IBM Almaden Research Center, USA Tyrone Grandison IBM Almaden Research Center, USA.

R. Agrawal and R. Srikant.2000, Privacypreserving data mining. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 439450, Dallas, TX, May 1419 2000. ACM.

Y. Lindell and B. Pinkas.2000, Privacy preserving data mining.In Advances in Cryptology CRYPTO 2000, pages 3654.SpringerVerlag.

L. Sweeney,2002, "Achieving k anonymity privacy protection using generalization and suppression," International Journal on Uncertainty, Fuzziness and Knowledgebased Systems, 10 (5), pp. 571588,

A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkita subramaniam.2006, ldiversity: Privacy beyond kanonymity. In Proc. 22nd Intnl. Conf. Data Engg. (ICDE), page 24.

Ninghui Li Tiancheng Li,Suresh Venkatasubramanian,tCloseness: Privacy Beyond kAnonymity and l Diversity.

E.Poovammal and M. Ponnavaikko, 2009,Task Independent Privacy Preserving Data Mining on Medical Dataset International Conference on Advances in Computing,Control, and Telecommunication Technologies.

Dinur and K. Nissim.,2003,Revealing information while preserving privacy, PODS, pages 202210.

M. Kantarcioglu, J. Jin, and C. Clifton,2004, When Do Data Mining Results Violate Privacy? Proc Intl Conf. Knowledge Discovery and Data Mining, pp. 599604, 2004

Agrawal D., Aggarwal C.,2002, On the Design and Quantification of Privacy Preserving Data MiningAlgorithms, ACM PODS Conference.

R.Agrawal, J.kiernan, R.Srikant,
Y. Xu,2002, Hippocratic databases, 28th International Conference on very large Databases, Hong Kong, China.

Jiawei Han and Micheline Kamber
, 2006,Data Mining: Concepts and Techniques, 2nd ed, ISBN 155860 9016.

http://en.wikipedia.org/wiki/RSA_( algorithm)