A Survey on Privacy Preserving Data Mining Techniques

Download Full-Text PDF Cite this Publication

Text Only Version

A Survey on Privacy Preserving Data Mining Techniques

Cina Mathew

Assistant Professor

Kristu Jyothi College of management and Technology

Abstract:- The emerging privacy concern has become a major obstacle in storing and sharing of data. The proliferation of data can be useful, but it must be performed in a way that preserves user's privacy. This is not straightforward, because the proliferated data need to be protected against several privacy threats. Various algorithms have been designed for privacy-preserving data mining, that can be classified into three categories i.e., privacy by policy, privacy by statistics, and privacy by cryptography. We review algorithms like; Randomization, k-anonymization, and distributed privacy- preserving data mining etc., derive insights on their operation, and compare their advantages and disadvantages. We also provide a study of the computational and hypothetical boundaries involved with privacy-preservation over high dimensional data sets.

Keywords: PPDM, Anonymization, Perturbation, Cryptography


    Recent years have seen unprecedented growth in applicability of Computer Science in day-to-day activities. Organizations, community and individuals show an augmented trend of storing their data in cloud. The huge amount of data collected can be used for analyzing trends of markets and individual or society. Data mining activities involve extracting knowledge from this massive pool of data. The sensitive information about the individuals may be disclosed creating ethical or privacy issues. Many individual therefore dont share their data publicly, creating data unavailability. Privacy of individual should not be compromised under any case. PPDM has gained popularity so as to address the privacy concerns while data mining is being carried out [1].

  2. PRIVACY PRESERVING DATA MINING [PPDM] Privacy preserving data mining is an area of data mining

    that is used to protect sensitive information from unsolicited or unsanctioned disclosure. It consists of techniques and methodologies of data mining, which would be used to fulfil privacy constraint and it also maintains the utilization of data for data mining. Privacy preserving data mining is solely based on description of privacy that defines the different attributes of data. It depicts which attribute is sensitive and hence required to ensure confidentiality constraint [2, 3]. The block diagram of PPDM is shown in figure;

    Figure 1: Blockdiagram of PPDM



    In this section we focus on number of methods that have recently been proposed for privacy preserving data mining. A survey on several privacy preserving data mining technologies are studied in [5] and the pros and cons of these technologies are analysed. In this paper, we analyse an overview of the state-of-the-art in privacy preserving data mining. In order to perform the privacy preservation most methods for computations use some form of transformation on the data. Typically, such methods reduce the granularity of representation in order to reduce the privacy. This reduction in granularity results in some loss of effectiveness of data and mining algorithms. This is the natural trade-off between information loss and privacy. Methods such as k- anonymity, l-diversity, t-closeness, classification, association rule mining are all designed to prevent identification to preserve the privacy of sensitive information. The Application of several techniques for preserving privacy on experimental dataset is illustrated in

    [6] and their effects on the results are revealed.

    1. Anonymization Algorithms

      Anonymization methods have emerged as an effective means to achieve privacy preservation. In these methods some part of the original data, for instance, through generalization, compression, etc., is transformed and let the transformed data cannot be combined with other information to reason about any personal privacy information. The implementation of privacy preservation mainly concentrates on two aspects: (1) How to ensure that the data been used without privacy disclosure? (2) How to make the data to be better utilized? So, the problem to be solved urgently is a trade-off between privacy preservation and data utilization.

    2. Perturbation Techniques

      Data Perturbation introduces random perturbation to individual values to preserve privacy before data are

      published. These techniques are statistically based methods that seek to protect confidential data by adding random noise to confidential, numerical attributes, thereby protecting the original data. Data Perturbation techniques are not encryption techniques, where the data is first modified, then (typically) transmitted, and then received, decrypted back to the original data. But the intent of these techniques is to allow authentic users the capability to access important aggregate statistics (such as mean, correlations, etc.) from the entire database while protecting the individual identify of a record.

    3. Distributed Privacy Preservation

    In many cases, individual entities may wish to derive aggregate results from data sets which are partitioned across these entities. For this purpose, Privacy preserving

    distributed data mining is used that aims to design secure protocols which allow multiple parties to conduct collaborative data mining while protecting the privacy of their data. Such partitioning may be horizontal (when the records are distributed across multiple entities) or vertical (when the attributes are distributed across multiple entities). In this the individual entities may consent to limited information sharing with the use of a variety of protocols and may not desire to share their entire data sets. The whole effect of such methods is to preserve privacy for each individual entity, while deriving aggregate results over the entire data.

    The advantages and limitations of some of the PPDM techniques are tabulated in Table 1.




    Anonymization Technique

    Secrecy of data are preserved.

    More information loss

    Perturbation Technique

    Preserves various attributes independently.

    Original data values cannot be regained.

    Distributed Data Mining

    It is an efficient technique. Simple and supports large databases.

    Minimal information loss.

    Cryptography Technique

    Data encryption and decryption using keys is accurate and improves security.

    Complexity and number of keys are proportional.

    Table1: Advantages and limitations of PPDM techniques


    Table 2 shows the all available PPDM methods for data mining to secure the data set. When we are transferring or

    exchanging the data set with fair enough security and also these methods ensures the various approaches which are being used to obtain the cryptosystem.

    S. No


    Year of Publication

    Technique Used for PPDM


    Result and Accuracy


    Y.Lindell, B.Pinkas [6]


    Cryptographic Technique

    Sensitive data are encrypted in different levels using keys.

    The complexity increases when more than a few keys are involved. Also, it does not hold good for large databases.


    L. Sweeney[7]


    K- Anonymity

    Information about an individual contained in a release cannot be distinguished from at least k-1 individuals information.

    Privacy is Preserved at greater levels.


    J. Vaidya and C. Clifton[8]


    Association Rule

    Data are vertically distributed into segments.

    Ensures privacy.


    HillolKargupta, Souptik Datta, Qi Wang and Krishnamoorthy Sivakumar[9]


    Data Perturbation

    Data Privacy is preserved by adding random noise.

    Randomization Techniques are used to generate random matrices.


    Charu C Aggarwal, Philip S. Yu[10]


    Condensation Approach

    Condenses the data into multiple groups of predefined size. The different records are not distinguishable.

    The use of pseudo-data no longer requires to redesign the data mining algorithms, since they have the original format.


    SlavaKisilevich, Lior Rokach, Yuval Elovici, BrachaShapira[12]



    Anonymization uses generalization and suppression

    for data hiding.

    Background knowledge and

    Homogeneity attacks of K-Anonymity algorithm

    do not preserve sensitivity of an individual.


    P.Deivanai, J. JesuVedhaNayahi andV.Kavitha1[3]


    Hybrid Approach

    Hybrid Approach is a combination of different techniques

    which combine to give an integrated result.

    It uses Anonymization and

    suppression to preserve data.


    George Mathew, Zoran Obradovic[14]


    Decision Tree

    An approach which is technical, methodological and should give judgemental knowledge.

    A graph-based framework for preserving patients sensitive information.


    M. N. Kumbhar and R. Kharat[16]


    Association Rule By Horizontal and Vertical Distribution

    Different approaches in the field of Association rule is reviewed.

    The performance

    of all models is analyzed in terms of privacy, security and communications.


    Savita Lohiya and LataRagha[17]


    Hybrid Approach

    A combination of K- Anonymity and Randomization.

    It has more accuracy and original data can be regained.


    George Mathew, ZoranObradovic[19]


    Distributed Privacy Preserving

    Provides an algorithm to collaboratively build a better decision-making model

    It improves the overall accuracy of a classification



    Shweta Taneja, Shashank Khanna, SugandhaTilwalia,



    Cryptography, Anonymization, Perturbation

    A tabular comparison of work done by different methods.

    Cryptography and Random Data Perturbation methods perform better than the other existing methods.


    M. Antony Sheela, K. Vijayalakshmi[24]


    Partition Based Perturbation

    Applied techniques on the vertically partitioned data.

    When the threshold value is reached,the individual data is changed.


    JalpeshVasa, PanthiniModi[25]



    Anonymization based techniques used to preserve privacy by reducing the granularity.

    Wasnt that perfect, so opted differential privacy.


    Privacy is the major concern to protect the sensitive data in today's world. People are very much anxious about their sensitive information which they dont want to share. In this paper our survey focuses on the existing literature present in the field of Privacy Preserving Data Mining. The primary objective of PPDM is promoting algorithm to hide sensitive data or offer privacy in data mining. From our analysis, we have found that that there is no single PPDM technique in existence that outshines every other technique with relation to each possible criterion such as use of data, performance, difficulty, compatibility with procedures for data mining, and so on. All methods perform in a different way depending on the type of data as well as the type of application or domain. But still from our analysis, we can conclude that Distributed data mining and Random Data Perturbation methods perform better than the other existing methods.


  1. Alpa Shah and Ravi Gulati, Privacy Preserving Data Mining: Techniques, Classification and Implications – A Survey, International Journal of Computer Application, Vol. 137 No 12, March 2016, 40-46.

  2. AlShwaier and A. Z. Emam, Data Privacy OnEHealth Care System, International Journal of Engineering, Business and Enterprise Applications, (2013).

  3. Xu, Yang, Tinghuai Ma, Meili Tang, and Wei Tian. "A survey of privacy preserving data publishing using generalization and suppression." Appl. Math 8, no. 3, pp. 1103-1116, (2014).

  4. Y.Li, B.Vinzamuri, C.K.Reddy, Constrained elastic net based knowledge transfer for health care information exchange, Data Mining Knowl. Discov. 29 (4) (2015) 10941112.

  5. Jian Wang, YongchengLuo ; Yan Zhao ; Jiajin Le, 2009,A Survey on Privacy Preserving Data Mining, First International Workshop on Database

  6. Grljevic, O., Bosnjak, Z., Mekovec, R. 2011, Privacy preserving in data mining – Experimental research on SMEs data, IEEE 9th International Symposium on Intelligent Systems and Informatics (SISY), 2011 , pp- 477 481.


Y. Lindell, B.Pinkas, Privacy preserving data mining, in proceedings of Journal of Cryptology, 5(3), 2000.

A Review Paper, in proceedings of 978-1-46735116- 4/12/$31.00_c, IEEE 2012.


L. Sweeney, "k-Anonymity: A Model for Protecting Privacy, in


S. Lohiya and L. Ragha, Privacy Preserving in Data Mining

proceedings of Int'l Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 2002.

Using Hybrid Approach, in proceedings of 2012 Fourth International Conference on Computational Intelligence and


J. Vaidya and C. Clifton, Privacy preserving association rule

Communication Networks, IEEE 2012.

mining in vertically partitioned data, in The Eighth ACM SIGKDD International conference on Knowledge Discovery and


Martin Beck and Michael Marh¨ofer, Privacy-Preserving Data Mining Demonstrator, in proceedings of 16th International

Data Mining, Edmonton, Alberta, CA, July 2002, IEEE 2002.

Conference on Intelligence in Next Generation Networks, IEEE


H. Kargupta and S. Datta, Q. Wang and K. Sivakumar, On the Privacy Preserving Properties of Random Data Perturbation



George Mathew, ZoranObradovic, "Distributed Privacy Preserving

Techniques, in proceedings of the Third IEEE International

Decision System for Predicting Hospitalization Risk in Hospitals


Conference on Data Mining, IEEE 2003.

C. Aggarwal , P.S. Yu, A condensation approach to privacy

with Insufficient Data", in proceedings of 2012 11th InternationalConference on Machine Learning and Applications

preserving data mining, in proceedings of International


Yuan Zhang , Sheng Zhong, "A privacy-preserving algorithm for

Conference on Extending Database Technology (EDBT), pp. 183 199, 2004. 746

distributed training of neural network ensembles", Neural Comput&Applic (2013) 22


A. Machanavajjhala, J.Gehrke, D. Kifer and M.


Shweta Taneja, Shashank Khanna, SugandhaTilwalia, Ankita, "A

Venkitasubramaniam, "I-Diversity: Privacy Beyond k- Anonymity", Proc. Int'l Con! Data Eng. (ICDE), p. 24, 2006

Review on Privacy Preserving Data Mining : Techniques and Research Challenges", International Journal of Computer Science


SlavaKisilevich, LiorRokach, Yuval Elovici, BrachaShapira,

and Information Technologies, Vol. 5 (2) , 2014, 2310-2315

Efficient Multi-Dimensional Suppression for K-Anonymity, inproceedings of IEEE Transactions on Knowledge and Data


Abel N. Kho, John P. Cashy, Karthryn L. Jackson, Adam R. Pah, SatyenderGoel, JornBoehnke, John Eric Humphries, Scott Duke

Engineering, Vol. 22, No. 3. (March 2010), pp. 334-347, IEEE

Kominers, Bala N. Hota, Shanon A. Sims, Bradley A. Malin,



P.Deivanai, J. JesuVedhaNayahi and V.Kavitha, A Hybrid Data

Dustin D. French, Theresa L. Walunas, David O. Meltzer, Erin O. Kaleba, Roderick C. Jones, Wiliam L. Galanter,"Design and

Anonymization integrated with Suppression for Preserving Privacy in mining multi party data in proceedings of International

implementation of a privacy preserving in electronic health record linkage tool in chicago" journal of American Medical Informatics

Conference on Recent Trends in Information Technology, IEEE

Association,(2015) 22(5)



G. Mathew, Z. Obradovic, A PrivacyPreserving Framework for


V. Baby , N. Subhash Chandra , " Privacy-Preserving Distributed Data Mining Techniques: A Survey ", International Journal of

Distributed Clinical Decision Support, in proceedings of 978-1-

Computer Applications (0975 8887) Volume 143 No.10, June


61284852-5/11/$26.00 ©2011 IEEE.

A. Parmar, U. P. Rao, D. R. Patel, Blocking based approach for



M. Antony Sheela, K. Vijayalakshmi,"Partition Based Perturbation

classification Rule hiding to Preserve the Privacy in Database , in

for Privacy Preserving Distributed Data Mining" CYBERNETICS

proceedings of International Symposium on Computer Science and Society, IEEE 2011.



Review of different privacy preserving techniques of PPDM


M. N. Kumbhar and R. Kharat, Privacy Preserving Mining of

Association Rules on horizontally and Vertically Partitioned Data:

Leave a Reply

Your email address will not be published.