Download Full-Text PDF Cite this Publication

**Open Access**-
**Total Downloads**: 7 -
**Authors :**Santhosh Kumar K -
**Paper ID :**IJERTCONV2IS13135 -
**Volume & Issue :**NCRTS – 2014 (Volume 2 – Issue 13) -
**Published (First Online):**30-07-2018 -
**ISSN (Online) :**2278-0181 -
**Publisher Name :**IJERT -
**License:**This work is licensed under a Creative Commons Attribution 4.0 International License

#### Data Privacy by Top Down Specialization using MapReduce Framework

Santhosh Kumar K Dept. of Computer Science

HKBK College of Engineering Bangalore, India.

AbstractFor the current research, data analysis and data mining we require the private data to be shared which brings privacy concerns. The privacy preservation of users data is very important since some of the countries have passed the privacy laws which tells that the sensitive information should not be exposed. Even after removing explicit identifying information such as Name and SSN, it is still possible to link released records back to their identities by matching some combination of nonidentifying attributes such as Sex, Zip, Birthdate. A useful approach to prevent such linking attacks, called k-anonymization is used to preserve the privacy by generalization. At present, the trend is Big Data, normal algorithms cannot handle the very large data. Scalability problems arise since current anonymization algorithms cannot handle the large datasets. In this paper we propose top down specialization algorithm which is done in two phase for data anonymization using mapreduce framework. Our algorithm will effectively perform anonymization and handles the scalability.

KeywordsK-Anonymity; Data Anonymization; top down specialization; Data Privacy; Cloud

INTRODUCTION

Cloud computing is an evolving paradigm withtremendousmomentum, but its unique aspects has concerns about security and privacy challenges. Cloud computing has generated significantinterest in both academia and industry, butits still an evolving paradigm. Cloud computing provides massive computation power and storage capacity by combining the commodity computers and it can be accessed through internet. Cloud computing reduces the investment for IT infrastructure and it can be used based on pay as you go basis. Even though cloud computing provides all the facilities; the customers are hesitant to use cloud services due to privacy and security concerns [1].

Privacy is one of the most concerned issues in cloud computing. Electronic health records and financial transaction records usually contains sensitive information but these data can offer significant human benefits if they are analyzed and mined by various research centers. For instance, Microsoft HealthVault, an online cloud health service, aggregates data from users and shares data with various research organizations. The data can be easily exposed by using traditional privacy protection on cloud. This can bring considerable economic loss or severe social reputation impairment to data owners. Hence, data privacy issues need to be addressed urgently before data sets are analyzed or shared on cloud.

Data anonymization has been extensively studied and widely adopted for data privacy and preservation in non

interactive data publishing and sharing scenarios [11]. Data anonymization refers to hiding identity and/or sensitive data of owners data records. Then, the privacy of an individual can be effectively preserved while certain aggregate information is exposed to data users for various analysis and mining. A variety of anonymization algorithms have been proposed [12], [13], [14], [15]. However, the scale of datasets that need anonymization in some cloud increases tremendously in accordance with the cloud computing and Big Data trend [1], [16]. Since the data sets size is very large it is difficult for the traditional anonymization algorithms to handle large datasets. The researchers have started investigation on the scalability problem of the large scale data anonymization [17], [18].

Large scale data processing framework like MapReduce

[19] has been integrated with cloud to provide higher and powerful computation capability for applications. So it will be useful for us to use such framework for addressing scalability problem in large scale data anonymization. So we use MapReduce to address the scalability problem of Top Down Specialization (TDS) approach [12] for large scale data anonymization. TDS is one of the widely used approach for data anonymization [12], [20], [21], [22]. TDS algorithms are centralized so it cannot handle large data sets. There are few distributed algorithms proposed [20], [22] but they handle the anonymization of third party data sets and they do not focus on scalability. Even though MapReduce is simple to implement, it is difficult to fit TDS in MapReduce framework.In this paper, we propose a scalable two phase TDS approach for data anonymization using MapReduce on cloud. Here we use highly optimized and highly efficient ARX anonymization tool libraries for k anonymity and Top Down Specialization [34]. In this we split the anonymization process into 2 phases. In the first phase, original data sets are partitioned into group of smaller data sets, and these data sets are anonymized in parallel, to get intermediate results. In the second phase, the intermediate results are integrated into one, and further anonymized to get consistent k-anonymous [23] data sets. We use MapReduce in both the phases. Experimental results show that in our approach efficiency and scalability of TDS is improved over existing approaches.

RELATED WORK

A.Related Work

Recently, data privacy preservation has beenextensivelyinvestigated [11]. We briefly review related

work below.LeFevre et al. [17] addressed the scalability problem ofanonymization algorithms via introducing scalable decisiontrees and sampling techniques. Iwuchukwu and Naughton[18] proposed an R-tree index-based approach by building aspatialindex over data sets, achieving high efficiency.However, the above approaches aim at

()and()can be computed via statistical informationderived from data sets.Let denote the set of originalrecords containing attribute values that can be generalizedto. | |is the number of data records in . Let

( )bethe entropy of . Then, () is calculated by

multidimensionalgeneralization [15], thereby failing to work in

= )

(2)

the TDSapproach. Fung et al. [12], [20], [21] proposed the

( )

| |

TDSapproach that produces anonymous data sets without thedata exploration problem [11]. A data structure TaxonomyIndexed PartitionS (TIPS) is exploited to improve theefficiency of TDS. But the approach is centralized,

Let | , |denote the number of the data records withsensitive valuein . ( )is computed by

= |( , )| . 2 |( , )| (3)

leadingto its inadequacy in handling large-scale data sets.

|( |

|( |

Several distributed algorithms are proposed to preserveprivacy of multiple data sets retained by multiple parties.Jiang and Clifton [24] and Mohammed [22] proposeddistributed algorithms to anonymize vertically partitioneddata from different data sources without disclosing privacyinformation from one party to another. Jurczyk and Xiong[25] and Mohammed et al. [20] proposed distributedalgorithms to anonymize horizontally partitioned data setsretained by multiple holders. However, the above distributedalgorithms mainly aim at securely integrating andanonymizing multiple data sources. Our research mainlyfocuses on the scalability issue of TDS anonymization, andis, therefore, orthogonal and complementary to them.

As to MapReduce-relevant privacy protection, Roy et al.[26] investigated the data privacy problem caused byMapReduce and presented a system named Airavat incorporatingmandatory access control with differentialprivacy. Further, Zhang [27] leveraged MapReduceto automaticallypartition a computing job in terms of datasecurity levels, protecting data privacy in hybrid cloud. Ourresearch exploits MapRedue itself to anonymize large- scaledata sets before data are further processed by otherMapReduce jobs, arriving at privacy preservation.

PRELIMINERY

A.Top-Down Speciaization

The anonymity of a data set is defined by the minimumgroup size out of all QI-groups, denoted as A, i.e., A= minqid QID {|QIG(qid)|} where |QIG(qid)| is the size of QIG(qid). Let Ap(spec) be that after performingspec. Privacy loss by spec is calculated by

= (4)

TWO PHASE TOP DOWN SPECIALIZATION (TPTDS)

A. Sketch of Two Phase Top Down Specialization

We propose a TPTDS approach to conduct the computationrequired in TDS in a highly scalable and efficient fashion.The two phases of our approach are based on the two levelsof parallelization provisioned by MapReduce on cloud. Combined with cloud, MapReduce becomesmore powerful and elastic as cloud can offer infrastructureresources on demand, for example, Amazon ElasticMapReduce service [29]. To achievehigh scalability, we are parallelizing multiple jobs on datapartitions in the first phase, but the resultant anonymizationlevels are not identical. To obtain finally consistentanonymous data sets, the second phase is necessary tointegrate the intermediate results and further anonymized entire data sets. Details are formulated as follows.

In the first phase, an original data setis partitionedinto smaller ones.Let , 1 , denote the data setspartitioned from, where is the number of partitions, and =

Generally, TDS is an iterative process starting from

=1

, , = , 1 < .

i

thetopmost domain values in the taxonomy trees of attributes.Each round of iteration consists of three main steps, namely,finding the best specialization, performing specializationand updating values of the search metric for the next round[12]. Such a process is repeated until k-anonymity isviolated, to expose the maximum data utility. The goodnessof a specialization is measured by a search metric. We adoptthe information gain per privacy loss (IGPL), a tradeoffmetric that considers both the privacy and informationrequirements, as the search metric in our approach. Aspecialization with the highest IGPL value is regarded asthe best one and selected in each round. We briefly describehow to calculate the value of IGPL subsequently to makereaders understand our approach well.

Given a specialization: (), the IGPL ofthe

Then, we run a subroutine over each of the partitioneddata sets in parallel to make full use of the job levelparallelization of MapReduce. The subroutine is a MapReduceversion of centralized TDS (MRTDS) which concretelyconducts the computation required in TPTDS. MRTDSanonymizes data partitions to generate intermediate anonymizationlevels. An intermediate anonymization levelmeans that further specialization can be performed withoutviolating k-anonymity. MRTDS only leverages the task levelparallelization of MapReduce.Formally, let functionMRTDS(D, k, AL) AL1 represent a MRTDS routine thatanonymizes data setDto satisfy k-anonymity from anonymizationlevelALto AL1. Thus, a series of functionsMRTDS Di, kI , AL0 AL1, 1 i p, are executed simultaneouslyin the first phase. The termkI denotes

specialization is calculated by

theintermediate anonymity parameter,

I

usually given by

I

applicationdomain experts. Note thatk should satisfy k

= () ( + 1)

(1)

ktoensure privacy preservation.AL0is the initial

i

The term()is the information gain after performing, ()is the privacy loss.

anonymizationlevel, i.e.,AL0 = Top1 , Top2 , , {Topm } , where Topj Domj , 1 j m, is the topmost domain value inTTi. AL1is the resultant intermediate anonymization level.

In the second phase, all intermediate anonymizationlevels are merged into one. The merged anonymizationlevel is denoted asALI . The merging process is formallyrepresented as

function( , 1 , , 1 ) . Then,the whole

Output: , 1

Map: Generate a random number , where 1 ; emit (, ).

1 2

data setD is further anonymized based on , achieving k- Reduce: For each rand, emit (, ()).

anonymity finally, i.e.,(, , ) , where

denotes the final anonymization level. Ultimately, is

Once partitioned data sets , 1 , are obtained,

concretely anonymized according to. Above all, Algorithm

werun( , , 0)

1 depicts thesketch of the two-phase TDS approach.

on these data sets in parallel

toderive intermediate anonymization levels, 1 .

Algorithm 1.SKETCH OF TWO-PHASE TDS (TPTDS).

Input: Data set , anonymity parameters , and the number of partitions .

C. Anonymization Level Merging

All intermediate anonymization levels are merged into onein the second phase. The merging of anonymization levels

iscompleted by merging cuts. Specifically, let in and

'

Output: Anonymous data set

1. Partition into , 1 .

2. Execute , , 0 , 1 in parallel as multiple MapReduce jobs.

3. Merge all intermediate anonymization levels into one,

1

2

( , , , ) .

4. Execute (, , ) to achieve k- anonymity.

in be two cuts of an attribute. There existdomain

values and . that satisfy one of the three conditions is identical to , is more generalthan , or is more specific than . To ensure that themerged intermediate anonymization level neverviolates privacy requirements, the more general one isselected as the merged one, for example, will be selected. will be selected if is more general than or identical to . For the case ofmultiple anonymization levels, we can merge them in thesame way iteratively. The following lemma ensures that stillcomplies privacy requirements.

Lemma 1. , 1

5. Specialize according to , Output .

, ,

, , , ,

The basic idea of TPTDS is to gain high scalability bymaking a tradeoff between scalability and data utility. Weexpect that slight decrease of data utility can lead to highscalability. The influence of and on the data utility isanalyzed as follows. The data utility produced via TPTDS isroughly determined by 2. Greatermeans that the specializations in are selected according to IGPL valuesfrom smaller data sets, resulting in exposing less datautility. However, greateralso implies smallerbutlarger2, which means more data utility can be producedbecause specializations in2are selected according anentire data set. Larger indicates larger 2, generatingmore data utility.

B. Data Partition

When is partitioned into , 1 , it is required that thedistribution of data records in is similar to . Adata recordhere can be treated as a point in an m-dimension space,where m is the number of attributes. Thus, the intermediateanonymization levels derived from , 1 , can be moresimilar so that we can get a better merged anonymizationlevel. Random sampling technique is adopted to partition, which can satisfy the above requirement.Specifically, arandom number, 1

, is generated for eachdata record. A record is assigned to the partition .Algorithm 2 shows the MapReduce program of

1 2

Ourapproach can ensure the degree of data privacy preservation,as TPTDS produces k-anonymous data sets finally.Lemma 1 ensures that the first phase produces conistentanonymous data sets that satisfy higher degree of privacypreservation than users specification. Then, MRTDS canfurther anonymize the entire data sets to produce final k- anonymousdata sets in the second phase.

D. Data Specialization

An original data set is is concretely specialized foranonymization in a one-pass MapReduce job. After obtainingthe merged intermediate anonymization level , we run (, , )on the entire data set, and get thefinal anonymization level. Then, the data setisanonymized by replacing original attribute values in withthe responding domain values in.

Details of Map and Reduce functions of the dataspecialization MapReduce job are described in Algorithm3. The Map function emits anonymous records and itscount. The Reduce function simply aggregates these anonymousrecords and counts their number. An anonymousrecord and its count represent a QI-group. The QI- groupsconstitute the final anonymous data sets.

Algorithm 3. DATA SPECIALIZATION MAP & REDUCE

datapartition. Note that the number of Reducers should be

equalto, so that each Reducer handles one value of, exactlyproducingresultant files. Each file contains a randomsample of.

Algorithm 2. DATA PARTITION MAP & REDUCE

Input: Data record , , , partition parameter .

Input: Data record , , ; Anonymization level . Output: Anonymous Record (, ).

Map: Construct anonymous record

= 1, 2, , , , , 1 , is the parent of a specialization in current. and is also an ancestor of in ; emit (, ).

Reduce: For each , ; emit (, ).

MAP REDUCE VERSION OF CENTRALIZED TDS

We elaborate the MRTDS in this section. MRTDS plays acore role in the two-phase TDS approach, as it is invoked inboth phases to concretely conduct computation. Basically, apractical MapReduce program consists of and

functions, and athat coordinates the macro executionof jobs.

A.MRTDS Driver

Usually, a single MapReduce job is inadequate toaccomplisha complex task in many applications. Thus, a group ofMapReduce jobs are orchestrated in a driver program toachieve such an objective. MRTDS consists of and two types of jobs, i.e., and . The driver arranges the execution of jobs.

Algorithm 4 frames where a data set

continues until allspecializations become invalid, achieving the maximumdata utility.

MRTDS produces the same anonymous data as thecentralized TDS in [12], because they follow the same steps.MTRDS mainly differs from centralized TDS on calculatingIGPL values. However, calculating IGPL values dominatesthe scalability of TDS approaches, as it requires TDSalgorithms to count the statistical information of data setsiteratively.

IGPL Initialization Job

The Map andReducefunctions of the job are described in Algorithms 5 and 6,respectively. The maintask of is toinitialize information gainand privacy loss of all specializations in the initialanonymization level. According to (2) and (3), thestatistical information , , , | | and

, isrequired for each specialization to calculate informationgain.

Algorithm 5. IGPL INITIALIZATION MAP

isanonymized by TDS. It is the algorithmic design of function(, , ) Notethat we leverage

anonymization level to manage the processof anonymization. Step 1 initializes the values of informationgain and privacy loss for all specializations, which canbe done by the job .

Algorithm 4. MRTDS DRIVER

Input: Data set anonymization level and k-anonymity parameter .

Output: Anonymization level .

Initialize the values of search metric IGPL, i.e., for each specialization . The IGPL value of

Input: Data record , , ; anonymization level . Output: Intermediate key-value pair (, ).

For each attribute value in , find its specialization in current : . Let be the parent in and be the schild value that is also an ancestor of in

.

2. For each , emit ( , , , ).

3. Construct quasi-identifier = 1, 2, , , where , 1 , is the parent of a specialization in current . Emit ( , $, # , ).

is computed by

=1

.

4. For each [1, ], replace in with its child

is also ancestor of . Let the resultant quasi-identifier

=1

2. While is valid

2.1. Find the best specialization from ,

be . Emit ( , $, # , ).

2.2. Update to +1

Algorithm 5 describes the function. The input isdata sets that consist of a number of records. is thesequence

2.3. Update information gain of the new specializations in

+1, and privacy loss for each specialization via job .

end while

Step 2 is iterative. First, the best specialization is selected from valid specializations in current anonymization level as described in Step 2.1. A specializationis a valid oneif it satisfies two conditions. One is that its parent value isnot a leaf, and the other is that the anonymity > , i.e., the data set is still k-anonymous ifis performed.Then, the current anonymization level is modified viaperforming the best specialization in Step 2.2, i.e., removingthe old specialization and inserting new ones that arederived from the old one. In Step 2.3, information gain ofthe newly added specializations and privacy loss of allspecializations need to be recomputed, which are accomplishedby job . The iteration

number of the record. Steps 1 and 2 are tocompute , , , | |and , . Step 1 gets thepotential specialization for the attribute values in. ThenStep 2 emits key-value pairs containing the information ofspecialization, sensitive value, and the count information ofthis record. According to the above information, wecompute information gain for a potential specialization inthe corresponding function. Step 3 aims at computingthe current anonymity (), while Step 4 is tocompute anonymity()after potential specializations.The symbol # is used to identify whether a key is emittedto compute information gain or anonymity loss, while thesymbol $ is employed to differentiate the cases whether akey is for computing ()or ().

Algorithm 6 specifies thefunction. The first step isto accumulate the values for each input key. If a key is forcomputing information gain, then the corresponding statisticalinformation is updated in Step 2.1. , (), and

()are calculated if all the count information they needhas been computed in Steps 2.2 and 2.3 in terms of (2) and

(3).A salient MapReduce feature that intermediate key- valuepairs are sorted in the shuffle phase makes the computationof()sequential with respect to the order of specializationsarriving at the same reducer. Hence, the reducer justneeds to keep statistical information for one specialization ata time, which makes the reduce algorithm highly scalable.

Algorithm 6. IGPL INITIALIZATION REDUCE.

Input: Intermediate key-value pair , ().

Output: Information gain , () and anonymity

(, ()), (, ()) for all specialization.

1. For each key, .

For each , if . #, update statistical counts:

2.1. (, ) , + ||,

( , ) + ( , ) , +

.

2.2. If all sensitive values for child have arrived, compute according to 3.

For each , if . = #, update anonymity.

3.1. If . = $and < (), update current anonymity: () .

3.2. If . $and < (), update potential anonymity of : () .

4. Emit (, ()) and emit (, ()).

Fig. 1. Execution framework overview of MRTDS

Algorithm 7. IGPL UPDATE MAP

Input: Data Record ( , ), ; Anonymization Level . Output: Intermediate key-value pair (, ).

Let be the attribute of the last best specialization. The value of this attribute in is . Find its specialization in : . Let be the parent in

, and be s child that is also an ancestor of ; Emit ( , , , ).

Construct quasi identifier

= 1, 2, , , , 1 , is the parent of

To compute the anonymity of data sets before and after aspecialization, Step 3.1 finds the smallest number of

a specialization in current and is also an ancestor of in .

For each [1, ], replace in with its child

is also the ancestor of in .

recordsout of all current QI-groups, and Step 3.2 finds all thesmallest number of records out of all potential QI-groupsfor

each specialization. Step 4 emits the results ofanonymity. Note that there may be more than one keyvaluepair(, ())for one specialization in outputfiles if more than one reducer is set. But we can find thesmallest anonymity value in the driver program. Then interms of (4), the privacy loss()is computed. Finally,()for each specialization is obtained by (1).

IGPL Update Job

The IGPL Updatejob dominates the scalability and efficiencyof MRTDS, since it is executed iteratively as described inAlgorithm 4. So far, iterative MapReduce jobs have not beenwell supported by standard MapReduce framework likeHadoop [30]. Accordingly, Hadoop variations like Haloop[31] and Twister [32] have been proposed recently tosupport efficient iterative MapReduce computation. Ourapproach is based on the standard MapReduce frameworkto facilitate the discussion herein.

The jobis quite similar to , except that it requires less computation and consumes lessnetwork bandwidth. Thus, the former is more efficient thanthe latter. Algorithm 7 describes thefunction of . The function is same as IGPL Initialization, already described in Algorithm 3.

After a specializationis selected as the bestcandidate, it is required to compute the information gainfor the new specializations derived from. So, Step 1 inAlgorithm 7 only emits the key-value pairs for the newspecializations, rather than all in Algorithm 5. Note that it isunnecessary to compute the information gain of otherspecializations because conducting the selected specializationnever affects the information gain of others. Comparedwith , only a part of data is processed andless network bandwidth is consumed.

Weneed to compute ()for all specializations in, described in Step 2 and 3 of Algorithm 7. Yet ()canbe directly obtained from the statistical information kept bythe last best specialization. Note that if the specializationrelated to in Step 3 is not valid, no resultant quasi-identifierwill be created.

Implementation and Optimization

To elaborate how data sets are processed in MRTDS, theexecution framework based on standard MapReduce isdepicted in Fig. 1. The solid arrow lines represent the dataflows in the canonical MapReduce framework. From Fig. 1,we can see that the iteration of MapReduce jobs iscontrolled by anonymization levelin . The dataflows for handling iterations are denoted by dashed arrowlines.is

dispatched from to all workers including and

via the distributed cache mechanism.The value ofis modified inaccording to theoutput of the and jobs. As theamount of such data is extremely small compared with datasets that will be anonymized, they can be efficientlytransmitted between and workers.

We adopt Hadoop [30], an open-source implementationof MapReduce, to implement MRTDS. Since most of and

functions need to access current anonymizationlevel. we use the distributed cache mechanism to passthe content of to each or

node as shownin Fig. 1.

CONCLUSIONS AND FUTUREWORKS

In this paper, we have investigated the scalability problemof large scale data anonymization by TDS, and proposed ahighly scalable two-phase TDS approach using MapReduceon cloud. Here we use highly efficient and highly optimized ARX anonymization tools libraries for k anonymity and Top Down Specialization. Data sets are partitioned and anonymized inparallel in the first phase, producing intermediate results.Then, the intermediate results are merged and furtheranonymized to produce consistent k-anonymous data setsin the second phase. We have integrated anonymization algorithms to fit into MapReduce framework to achieve scalability. Experimental results show that the scalability and efficiency of TDS are improved significantlyover existing approaches.

In cloud environment, the privacy preservation for dataanalysis. Sharing and mining is a challenging research issuedue to increasingly larger volumes of data sets, therebyrequiring intensive investigation. We will investigate theadoption of our approach with ,

for data anonymization.

REFERENCES

S. Chaudhuri, What Next?: A Half-Dozen Data ManagementResearch Goals for Big Data and the Cloud, Proc. 31st Symp.Principles of Database Systems (PODS 12), pp. 1-4, 2012.

M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R. Katz, A.Konwinski,

G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M.Zaharia, A View of Cloud Computing, Comm. ACM, vol. 53,no. 4, pp. 50-58, 2010.

L. Wang, J. Zhan, W. Shi, and Y. Liang, In Cloud, Can ScientificCommunities Benefit from the Economies of Scale?, IEEE Trans.Parallel and Distributed Systems, vol. 23, no. 2, pp.296-303, Feb.2012.

H. Takabi, J.B.D. Joshi, and G. Ahn, Security and PrivacyChallenges in Cloud Computing Environments, IEEE Securityand Privacy, vol. 8, no. 6, pp. 24-31, Nov. 2010.

D. Zissis and D. Lekkas, Addressing Cloud Computing SecurityIssues, Future Generation Computer Systems, vol. 28, no. 3, pp. 83-592, 2011.

X. Zhang, C. Liu, S. Nepal, S. Pandey, and J. Chen, A PrivacyLeakage Upper-Bound Constraint Based Approach for Cost-Effective Privacy Preserving of Intermediate Data Sets in Cloud,IEEE Trans. Parallel and Distributed Systems, to be published, 2012.

L. Hsiao-Ying and W.G. Tzeng, A Secure Erasure Code-BasedCloud Storage System with Secure Data Forwarding, IEEE Trans.Parallel and Distributed Systems, vol. 23, no. 6, pp. 995-1003, 2012.

N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, Privacy- PreservingMulti-Keyword Ranked Search over Encrypted Cloud Data, Proc.IEEE INFOCOM, pp. 829-837, 2011.

P. Mohan, A. Thakurta, E. Shi, D. Song, and D. Culler, Gupt:Privacy Preserving Data Analysis Made Easy, Proc. ACMSIGMOD Intl Conf. Management of Data (SIGMOD 12), pp. 349-360, 2012.

Microsoft HealthVault,

http://www.microsoft.com/health/ww/products/Pages/healthvault.aspx, 2013.

B.C.M. Fung, K. Wang, R. Chen, and P.S. Yu, Privacy-PreservingData Publishing: A Survey of Recent Devel- opments, ACMComputing Surveys, vol. 42, no. 4, pp. 1-53, 2010.

B.C.M. Fung, K. Wang, and P.S. Yu, Anonymizing ClassificationData for Privacy Preservation, IEEE Trans. Knowledge and DataEng., vol. 19, no. 5, pp. 711-725, May 2007.

X. Xiao and Y. Tao, Anatomy: Simple and Effective PrivacyPreservation, Proc. 32nd Intl Conf. Very Large Data Bases (VLDB06), pp. 139-150, 2006.

K. LeFevre, D.J. DeWitt, and R. Ramakrishnan, Incognito:Efficient Full-Domain K-Anonymity, Proc ACM SIGMOD IntlConf. Management of Data (SIGMOD 05), pp. 49-60, 2005.

K. LeFevre, D.J. DeWitt, and R. Ramakrishnan, MondrianMultidimensional K-Anonymity, Proc. 22nd Intl Conf. DataEng. (ICDE 06), 2006.

V. Borkar, M.J. Carey, and C. Li, Inside Big Data Management:Ogres, Onions, or Parfaits?, Proc. 15th Intl Conf. ExtendingDatabase Technology (EDBT 12), pp. 3-14, 2012.

K. LeFevre, D.J. DeWitt, and R. Ramakrishnan, Workload- AwareAnonymization Techniques for Large-Scale Data Sets, ACMTrans. Database Systems, vol. 33, no. 3, pp. 1-47, 2008.

T. Iwuchukwu and J.F. Naughton, K-Anonymization as SpatialIndexing: Toward Scalable and Incremental Anonymization,Proc. 33rd Intl Conf. Very Large Data Bases (VLDB

07), pp. 746-757,2007.

J. Dean and S. Ghemawat, Mapreduce: Simplified Data Processingon Large Clusters, Comm. ACM, vol. 51, no. 1, pp. 107-113,2008.

N. Mohammed, B. Fung, P.C.K. Hung, and C.K. Lee, Centralizedand Distributed Anonymization for High-Dimensional HealthcareData, ACM Trans. Knowledge Discovery from Data, vol. 4, no. 4,Article 18, 2010.

B. Fung, K. Wang, L. Wang, and P.C.K. Hung, Privacy-Preserving Data Publishing for Cluster Analysis, Data andKnowledge Eng., vol. 68, no. 6, pp. 552-575, 2009.

N. Mohammed, B.C. Fung, and M. Debbabi, Anonymity MeetsGame Theory: Secure Data Integration with Malicious Participants,VLDB J., vol. 20, no. 4, pp. 567-588, 2011.

L. Sweeney, k-Anonymity: A Model for Protecting Privacy, IntlJ. Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 5,pp. 557-570, 2002.

W. Jiang and C. Clifton, A Secure Distributed Framework forAchieving k-Anonymity, VLDB J., vol. 15, no. 4, pp. 316-333,2006.

P. Jurczyk and L. Xiong, Distributed Anonymization: AchievingPrivacy for Both Data Subjects and Data Providers, Proc. 23rdAnn. IFIP WG 11.3 Working Conf. Data and Applications SecurityXXIII (DBSec 09), pp. 191-207, 2009.

I. Roy, S.T.V. Setty, A. Kilzer, V. Shmatikov, and E. Witchel,Airavat: Security and Privacy for Mapreduce, Proc. SeventhUSENIX Conf. Networked Systems Design and Implementation (NSDI10), pp. 297- 312, 2010.

K. Zhang, X. Zhou, Y. Chen, X. Wang, and Y. Ruan, Sedic:Privacy- Aware Data Intensive Computing on Hybrid Clouds,Proc. 18th ACM Conf. Computer and Comm. Security (CCS 11),pp. 515-526, 2011.

X. Xiao and Y. Tao, Personalized Privacy Preservation, Proc.ACM SIGMOD Intl Conf. Management of Data (SIGMOD 06),pp. 229-240, 2006.

Amazon Web Services, Amazon Elastic Mapreduce, http://aws.amazon.com/elasticmapreduce/, 2013.

Apache, Hadoop,http://hadoop.apache.org, 2013.

Y. Bu, B. Howe, M. Balazinska, and M.D. Ernst, The HaloopApproach to Large-Scale Iterative Data Analysis, VLDB J., vol. 21,no. 2, pp. 169- 190, 2012.

J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu,and G. Fox, Twister: A Runtime for Iterative Mapreduce, Proc.19th ACM Intl Symp. High Performance Distributed Computing(HDPC 10), pp. 810-818, 2010.

UCI Machine Learning Repository, ftp://ftp.ics.uci.edu/pub/ machine- learning-databases/,2013.

ARX powerful data anonymization, http://arx.deidentifier.org/