 Open Access
 Total Downloads : 263
 Authors : Sampada U. Nagmote, Prof. P. P. Deshmukh
 Paper ID : IJERTV2IS111007
 Volume & Issue : Volume 02, Issue 11 (November 2013)
 Published (First Online): 30112013
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Review on Discrimination Prevention Using both Direct and Indirect Method in Data Mining
Sampada U. Nagmote
P.R.Patil College of Engg & Technology, Amravati
Prof. P. P. Deshmukh
P.R.Patil College of Engg & Technology, Amravati
Abstract
Today, Data mining is an increasingly important technology. It is a process of extracting useful knowledge which is in large collections of data. There are some negative understanding about data mining, among which potential privacy invasion and potential discrimination. Discrimination means is the unequal or inferior treatment of persons on some of certain characteristics or unfairly treating people on the basis of their specific belonging group. If the data sets are divided on the basis of sensitive attributes like gender, race, religion, etc., discriminatory decisions may ensue. For this reason, antidiscrimination techniques for discrimination prevention have been introduced for data mining. Discrimination can be either direct or indirect. Direct discrimination occurs when decisions are made based on some sensitive attributes. It consists of rules or procedures that explicitly mention minority or disadvantaged groups based on sensitive discriminatory attributes related to group membership. Indirect discrimination occurs when decisions are made based on nonsensitive attributes which are strongly related with biased sensitive ones. It consists of rules or procedures that, which is not explicitly mentioning discriminatory attributes, intentionally or unintentionally, could generate decisions about discrimination. In this paper, we discuss basic definition of direct and indirect discrimination, discrimination prevention in data mining, rule protection etc.

Introduction
Discrimination is an act of unfairly treating people on the basis of their belonging to some aspecic group. For instance, individuals may be discriminated because of their race, gender, etc. [5] or it is the treatment to an individual based on their membership in particular
category or group. There are various laws which are used to prevent discrimination on basic of various attributes such as race, religion, nationality, disability and age.
There are two types of discrimination i.e. Direct Discrimination and Indirect Discrimination. Direct Discrimination is direct discrimination which consists of rules or procedure that mention minority or disadvantaged group based on sensitive attributes to they are related to membership of group. Indirect Discrimination is discrimination which consists of rules and procedures that are not mentioning attributes and hence it generates discriminatory decision intentionally or unintentionally. [1]
The points which will discuss in this paper are:

Process for Discrimination Discovery

Basic Definitions

Direct Discrimination Prevention Method

Indirect Discrimination Prevention Method


Process for Discrimination Discovery
Process of discrimination discovery is about for nding out discriminatory decisions which are in a dataset. The most basic problem in the discrimination analysis, given a dataset , is to quantify the degree of discrimination suffered by a given group in a given context with respect to the classication decision.
Figure 1 shows the process of discrimination discovery, based on approaches and measures described in this section. [5]
Classification Rule Extraction
Classification Rule Extraction
Original Dataset
Frequent Classificat ion rule
Discrimin ation rule
Compare the computed value with discriminato ry threshold
Compare the computed value with discriminato ry threshold
Compare for each PD classification rule
Compare for each PD classification rule
PD &PND
classificatio n rule
Intruder=YES.X is called the premise (or the body) of the rule.

The support of an itemset, supp(X), is the fraction of records that contain the item set X. We say that a rule X C is completely supported by a record if both X and C appears in the record.

The condence of a classication rule, Con(X C), measures how often the class item C appears in records that contain X.
Hence, if supp(X) > 0
supp(X, C)
Con (X C) = (1)
supp (X)
Support and condence range over [0, 1].

In addition, the notation also extends to negated item sets, i.e. Â¬X.

A frequent classication rule is a classication rule with a support or condence greater than a specied lower bound. Let DB be a database of original data records and FRs be the database of frequent classication rules.[5]

Direct Discrimination Prevention

Definition of Direct Discrimination
Definition 1. Let A; B C be a classification rule such that con(BC) > 0. The extended lift of the rule
Categories Frequent rule to PD&PND
rules
Categories Frequent rule to PD&PND
rules
Predetermined discriminatory&
Nondiscriminatory items in DB
Fig1. Process of Discrimination Discovery

Basic Denition

An item is an attribute with its value, e.g.{Gender =Female}.

In association/ classication rule mining ,it attempts, to predict the occurrence of an item based on the occurrences of other items in the transaction from the given a set of transactions (records).

An item set is a set of one or more items, e.g.
{Gender=Male, Zip=54341}.

A classication rule is an expression X C, whereas X is an set of item which containing no class items, and C is a class item, e.g.{Gender=Female, Zip=54341}

is
con (A; BC )
elift ( A;BC) = (2)
con(BC)
Here the idea is that to evaluate the discrimination of a rule.
Definition 2. Let R be a fixed thresholdÂ¹and let A be a discriminatory item set. A PD classification rule c
=A, B C, C is protective w.r.t. elift if elift(c) < . Otherwise, c is discriminatory. The aim of direct discrimination discovery is to identify discriminatory rules. [12][13]


Direct Rule Protection
In order to convert each discriminatory rule into an protective rule, which based on the direct
discriminatory measure (i.e., Definition 2), we should have the following inequality for each – discriminatory rule r': A;BC in MR, where A is a discriminatory item set:
elift (r') < (3) This inequality can be rewritten as
Con(r': A, BC)
< (4)
Con (BC)
Let us rewrite Inequality (4) in the following way: Con(r': A, B C) < . Con (BC) (5)
So, it is clear that Inequality (3) can be satisfied by decreasing the confidence of the discriminatory rule r': A; B C to a value less than the righthand side of Inequality, without affecting the confidence of its base rule BC. Then possible solution for decreasing is
supp (A, B, C)
Con(r': A, BC) = (6)
supp (A,B)
This is to perturb the discriminatory item set from Â¬A to A in the subset DBc of all records of the original data set which completely support the rule r': A;B C and have some impact on other rules; for doing this increases the denominator of Expression (6) while keeping the numerator and Con (BC) as it is. There is also another way to provide direct rule protection.
Let us rewrite bove Inequality in the following different way:
Similar data transformation methods could be applied to obtain direct rule protection with respect to other measures (i.e., slift and olift). [1] Algorithms based on this rule protection are given in paper [1].


Indirect Discrimination Prevention

Definition of Indirect Discrimination
Denition1. A PND classication rule r:
X (D, B) C is a redlining rule if it could yield an – discriminatory rule r': A, BC in combination with currently available background knowledge rules of the form rb: A, B D and rb2: D, B A, whereas A is a discriminatory itemset.
Denition2. A PND classication rule r : X(D,B)C is a nonredlining rule if it cannot yield any – discriminatory rule r': A,BC in combination with currently available background knowledge rules of the form rb1 : A,B D and rb2 :D,B A, where A is a discriminatory itemset.
In correlation between the discriminatory itemset A and the nondiscriminatory itemset D with context B indicated by the background rules rb1and rb2 holds with condences at least 1 and 2, respectively; however, it is not a completely certain correlation.
Let RR be the database of redlining rules extracted from database DB. [6]

Indirect Rule Protection
con (r': A, B C)
con (B C) > = (7)
It is clear that Inequality (3) can be satisfied by increasing the confidence of the base rule (BC) of the discriminatory rule r': A; BC to a value higher than the righthand side of Inequality (7), without affecting the value of con(r': A; BC). A possible solution for increasing Expression
supp (B, C)
con (B C) (8)
supp (B)
is to perturb the class item from Â¬C to C in the subset DB all records of the original data set which completely
support the rule Â¬A;BÂ¬C and have minimum impact on other rules; doing so increases the numerator of Expression (8) while keeping the denominator and con(r': A,B C)
There are some other methods that could be applied for direct rule protection.
In order to turn out redlining rule into an nonredlining rule, based on the indirect discrimination measure, we should enforce the following inequality for each redlining rule
r: D; B C in RR:
elb (, ) < . (9) The above inequality can be rewritten as
Con (r1)
(con (r2) + con(r: D, BC)1) Con (r2)
< (10)
con (BC)
Discriminatory item set (i.e., A) is not removed from the original database DB and the rules rb1: A; B D and rb2: D; B A are obtained from DB, so that their Confidences might change and which is result of data transformation for indirect discrimination prevention.
Let us rewrite the above inequality in the following way:
. Con (BC). Con (rb2)
Con (rb1: A, BD) <
Con (rb2) + Con(r: D, BC)1
(11)
Clearly, in this case inequality (9) can be satisfied by decreasing the Confidence of rule rb1: A; B D to values less than the righthand side of Inequality (11) without affecting either the Confidence of the redlining rule or the confidence of the B C and rb2 rules. Since the values of both inequality sides are dependent, a transformation is required that decreases the lefthand side of the inequality without any impact on the right hand side. A possible solution for decreasing
supp (A, B,D)
Con (A, BD) = (12)
supp (A, B)
in Inequality (11) to the target value is to perturb the discriminatory item set from Â¬A to A in the subset DBc of all records of the original data set which completely support the rule Â¬A; B; Â¬DÂ¬C and have minimum impact on other rules; this increases the denominator of Expression (12) while keeping the numerator and Con(r:BC), Con(rb2:D;BA), and Con(r: D;B
C) unaltered.
There is another way to provide indirect rule protection. Let us rewrite Inequality (10) as Inequality (13), where the confidences of rb1 and rb2 rules are not constant con (BC)
Con (r1)
(con (r2)+ con(r:D,BC)1)
Con (r2)
> (13)
Clearly, in this case Inequality (9) can be satisfied by increasing the Confidence of the base rule (BC) of the redlining rule r: D;BC to values greater than the righthand side of Inequality (13) without affecting either the confidence of the redlining rule or the Confidence of the rb1 and rb2 rules. A possible solution for increasing Expression (8) in Inequality (13) to the target value is to perturb the class item from Â¬C to C in the subset DBc of all records of the original data set which completely support the rule Â¬A;B; Â¬DC and have minimum impact on other rules; this increases the numerator of Expression (8) while keeping the denominator and Con(rb1 : A;BD), Con(rb2 : D;B
A), and Con(r : D;BC) unaltered.


Conclusion
In sociology, discrimination is the prejudicial treatment of an individual based on their membership in a certain group or category. It involves denying to members of one group opportunities that are available to other groups. Like privacy, discrimination could have negative social impact on acceptance and dissemination of data mining technology. Discrimination is a very important issue when considering the legal and ethical aspect. In order to prevent both direct and indirect discrimination in a dataset, a rst step consists in discovering whether there exists direct or indirect discrimination. If any discrimination is found, the dataset is modied until discrimination is brought below a certain threshold or entirely eliminated.

References

European Commission, EU Directive 2004/113/EC on Antidiscrimination, http://eur
lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2004:373
:0037:0043:EN:PDF, 2004.

European Commission, EU Directive 2006/54/EC on Antidiscrimination, http://eurlex.europa.eu/LexUriServ/ LexUriServ.do?uri=OJ:L:2006:204:0023:0036:en:PDF, 2006.

S. Hajian, J. DomingoFerrer, and A. MartÂ´nez BallesteÂ´,Discrimination Prevention in Data Mining for Intrusion and Crime Detection, Proc. IEEE Symp. Computational Intelligence in Cyber Security (CICS11), pp 4754, 2011.

S. Hajian, J. DomingoFerrer, and A. MartÂ´nezBallesteÂ´, Rule Protection for Indirect Discrimination Prevention in DataMining, Proc. Eighth Intl Conf. Modeling Decisions for Artificial Intelligence (MDAI 11), pp. 211222, 2011.

F. Kamiran and T. Calders, Classification without Discrimination, Proc. IEEE Second Intl Conf. Computer, Control and Comm.(IC4 09), 2009.

F. Kamiran and T. Calders, Classification with no Discrimination by Preferential Sampling, Proc. 19th Machine Learning Conf.Belgium and The Netherlands, 2010.

F. Kamiran,T. Calders, and M. Pechenizkiy, Discrimination Aware Decision Tree Learning, Proc. IEEE Intl Conf. Data Mining (ICDM 10), pp. 869874, 2010.

R. Kohavi and B. Becker, UCI Repository of MachineLearningDatabases,http://archive.ics.uci.edu/ml/dat asets/Adult, 1996.

D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz, UCIRepository of Machine Learning Databases, http://archive.ics.uci.edu/ml, 1998.

D. Pedreschi, S. Ruggieri, and F. Turini, DiscriminationAware Data Mining, Proc. 14th ACM Intl Conf. Knowledge Discovery and Data Mining (KDD 08), pp. 560568, 2008.

D. Pedreschi, S. Ruggieri, and F. Turini, Measuring Discrimination in SociallySensitive Decision Records, Proc. Ninth SIAM Data Mining Conf. (SDM 09), pp. 581 92, 2009.

D. Pedreschi, S. Ruggieri, and F. Turini, Integrating Induction and Deduction for Finding Evidence of Discrimination, Proc. 12th ACM Intl Conf. Artificial Intelligence and Law (ICAIL 09), pp. 157166, 2009.

S. Ruggieri, D. Pedreschi, and F. Turini, Data Mining for Discrimination Discovery, ACM Trans. Knowledge Discovery from Data, vol. 4, no. 2, article 9, 2010.

S. Ruggieri, D. Pedreschi, and F. Turini, DCUBE: Discrimination Discovery in Databases, Proc. ACM Intl Conf. Management of Data (SIGMOD 10), pp. 11271130, 2010.

P.N. Tan, M. Steinbach, and V. Kumar, Introduction to DataMining. AddisionWesley 2006.