 Open Access
 Total Downloads : 828
 Authors : B. Hemada, K.S.Vijaya Lakshmi
 Paper ID : IJERTV2IS80599
 Volume & Issue : Volume 02, Issue 08 (August 2013)
 Published (First Online): 26082013
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
A Study On Discretization Techniques
B. Hemada ,
Department of Computer Science and Engineering,
V.R.Siddhartha Engineering College (Autonomous)
Affiliated to JNTUK, Vijayawada, Andhra Pradesh, India.
K.S.Vijaya Lakshmi Department of Computer Science
and Engineering, V.R.Siddhartha Engineering College (Autonomous)
Affiliated to JNTUK, Vijayawada, Andhra Pradesh, India.
Abstract
The task of extracting knowledge from databases is quite often performed by machine learning algorithms. Many algorithms can only process discrete attributes. Realworld databases often involve continuous features. Those features have to be discretized before using such algorithms. Discretization methods can transform continuous features into a finite number of intervals, where each interval is associated with a numerical discrete value. This paper analyzed existing data discretization techniques for data preprocessing. Firstly, the importance and process of discretization is studied. Furthermore, we conduct an experimental study of discretization methods involving the most representative and newest discretizers. It's essential to select proper methods depending on learning environment. At last, the thought of choosing the best discretization methods in association analysis is proposed as future research.
Keywords Discretization, Continuous data, Data Mining, Classification.

Introduction.
Data mining can be defined as the non trivial process of identifying valid, novel, potentially useful, ultimately understandable patterns in data. Even though the modeling phase is the core of the process, the quality of the results relies heavily on data preparation which usually takes around 80% of the total time. An interesting method for data preparation is to discretize the input variables. Discretization of continuous attributes plays an important role in knowledge discovery. Many algorithms related to
data mining require the training examples that contain only discrete values, and the rules generated by classification algorithms with discrete values are normally shorter and more understandable. Suitable discretization is useful to increase the generalization and accuracy of discovered knowledge.
Discretization is the process of dividing the range of the continuous attribute into intervals. Every interval is labeled a discrete value, and then the original data will be mapped to the discrete values.
Discretization of the continuous attributes is an important preprocessing approach for data mining and machine learning algorithm. An effective discretization method not only can reduce the demand of system memory and improve the efficiency of data mining and machine learning algorithm, but also make the knowledge extracted from the discretized dataset more compact, easy to be understand and used. Research shows that picking the best split points is a NPcomplete problem. The result of discrimination is related not only with the discretization algorithm itself but also with the data distribution and the number of split points. When the same discretization algorithm is applied to different dataset, we may get different result. We can only know the effectiveness of the discretization method by the result of post processing. So whether the discretization method is good or not is also related with the induction algorithm adopted later.
There are many advantages of using discrete values over continuous ones: (1) Discretization will reduce the number of continuous features' values, which brings smaller demands on system's storage. (2) Discrete features are closer to a knowledgelevel representation than continuous ones. (3) Data can also be reduced and simplified through discretization. For both users and experts, discrete features are easier to understand, use, and explain. (4)
Discretization makes learning more accurate and faster. (5) In addition to the many advantages of having discrete data over continuous one, a suite of classification learning algorithms can only deal with discrete data. Successful discretization can significantly extend the application range of many learning algorithms.

Categorization of Discretization Approaches.
Discretization algorithms can be categorized into supervised and unsupervised based on whether the class label information is used. Supervised discretization uses class information to guide the discretization process, while the unsupervised discretization does not. Equal Width and Equal Frequency are two representative unsupervised discretization algorithms. Many supervised discretization techniques have been proposed to date, of which the EntropyMDLP discretization has been accepted as by far the most effective in the context of both decision tree learning and rule induction algorithms. Compared to supervised discretization, previous research has indicated that unsupervised discretization algorithms have less computational complexity, but may result in much worse classification performance. When classification performance is the main concern, supervised discretization should be adopted.
The usage of Discretization methods can be dynamic or static. A dynamic method would discretize continuous values when a classifier is being built, such as in C4.5 while static discretization is done prior to classification task.
Another dimension of discretization methods is local vs. global. A local method would discretize in a localized region of instance space (i.e., a subset of instances) while a global discretization method uses the entire instance space to discretize.
Discretization methods can also be grouped in terms of topdown or bottomup. Topdown methods start with an empty list of cutpoints (or splitpoints) and keep on adding new ones to the list by splitting intervals as the discretization progresses. Bottomup methods start with the complete list of all the continuous values of the feature as cutpoints and gradually remove some of them by merging intervals as the discretization progresses.
Another dichotomy is direct vs. incremental. Direct methods divide the range of k intervals simultaneously (i.e., equalwidth, equalfrequency, or Kmeans), needing an additional input from the user
to determine the number of intervals. Incremental methods begin with simple discretization and are followed by an improvement or refinement process, which requires a stopping criterion to halt further discretization.
Discretization can be univariate or multivariate. Univariate discretization quantifies one feature at a time while multivariate discretization considers simultaneously multiple features.

A Typical Discretization Process.
A typical (univariate) discretization process broadly consists of four steps. (1) Sort the continuous values of the feature to be discretized, (2) Evaluate a cut point for splitting or adjacent intervals for merging,

According to some criterion, split or merge the intervals of continuous value, and (4) finally stop discretization.

Sorting
The continuous values for a feature are sorted in either ascending or descending order. If sorting is done once and for all at the beginning of discretization, it is global treatment and can be applied when the entire instance space is used for discretization. If sorting is done at each iteration of a process, it is a local treatment in which only a region of entire instance space is considered for discretization.

Choosing a cutpoint
After sorting, the next step in the discretization processis to find the best cutpoint to split a range of continuous values or the best pair of adjacent intervals to merge. There are numerous evaluation functions such as entropy measures and statistical measures.

Splitting / Merging
In the topdown approach, intervals are split while for a bottomup approach intervals are merged. For splitting, it is required to evaluate cutpoints and to choose the best one and split the range of continuous values into two partitions. Discretization continuous with each part (increased by one) until a stopping criteria is satisfied. For merging, adjacent intervals are evaluated to find the best pair of intervals to merge in each iteration. Discretization continuous with the reduced number (decreased by one) of intervals until the stopping criterion is satisfied.

Stopping Criteria

A stopping criterion specifies when to halt the discretization process. A stopping criterion can be very simple such as fixing the number of intervals at the beginning or a more complex one like evaluating a function.


Literature Survey.
The past few decades have seen many researches on discretization for mining association rules. In this paper we study few unsupervised and supervised discretization methods. EqualWidth and Equal Frequency are two commonly used unsupervised discretization methods. Both EqualWidth and Equal Frequency methods require a parameter n, indicating
=1
=1
the maximum number of intervals in discretizing a
the outliers using a threshold.
Fayyad and Irani proposed a supervised discretization method, EntMDLP [3] which uses entropy measure to find a potential cutpoint to split a range of continuous values into two intervals. An entropy based method will use the class information entropy of candidate partitions to select boundaries for discretization. Class information entropy is a measure of purity and it measures the amount of information which would be needed to specify to which class an instance belongs. The entropy measure in the context of classification can be defined as
Ee = E1 + E2
=
=
Ee = – pleft 1 i,left log pi,left
feature. The Equalwidth [2] discretization technique determines the interval width according to the user
where
– pright
i,right log pi,right
specified number of intervals using the relation iw = (max_value min_value) / n,
where
iw = interval width
max_value = maximum value of the attribute min_value = minimum value of the attribute
It then creates the cut points using the relation cut_point = min_value + j * iw
where
j=1,2,n1
The Equalfrequency [2] discretization technique is similar to equalwidth with the exception that the number of unique values (frequency) within each of the userspecified n intervals should be equal. The interval frequency is obtained using the following relation
if = nb_unique_values / n
where
if = interval frequency
nb_unique_values = number of unique values for a continuous attribute
Ee = entropy of the cutpoint
E1 = entropy to the left of the cutpoint E2 = entropy to the right of the cutpoint k = total number of classes
i = a practical class
pleft = number of instances to the left of cutpoint / total number of instances, N
pright = number of instances to the right of cutpoint / total number of instances, N
pi,left = num of instances of class i to the left of cut point /
number of instances to the left of cutpoint pi,right = {num of instances of class i to the right of cutpoint } / { number of instances to the right of cut point}
It considers one big interval containing all known values of a feature and then recursively partitions this interval into smaller subintervals until the stopping criterion satisfies. The stopping criterion was based on the MDL (Minimum Description Length) principle which is defined as
log2 (N1) log2 (3k2) k E + k1 E1 + k2 E gain > +
N N
The two methods are simple but are sensitive to n. For equalfrequency, for instance, many occurrences of a continuous value could cause the occurrences to be assigned into different bins. This can be handled by adjusting boundaries of neighboring bins so that duplicate values should belong to one bin only.
where
E = – log p
=1 i i
=1 i i
Number of instances of Class i
pi =
N
Another problem is the presence of outliers that take extreme values. This can be overcome by removing
gain = E Ee = information gained by splitting at the
cut point
N = total number of instances in the attribute value list at each recursion
k1 = number of classes to the left of the cutpoint k2 = number of classes to the right of the cutpoint.
A supervised, static and global discretization method which uses the Gini gain[4] as discretization measure was proposed by XiaoHang Zhang, Jun Wu, Ting Jie Lu and Yuan Jiang. In this discretization method the cut point is chosen based on the criterion, whose Gini gain value is the biggest on attribute A. The Gini gain G is defined as
 S1 S2
G(A,b;S) = Gini(S) – Gini(S ) – Gini(S )
Lukasz A. Kurgan and Krzysztof J. Cios proposed a supervised CAIM (classattribute interdependence maximization) discretization algorithm [5] that handles continuous and mixed mode attributes. The CAIM algorithms goal is to find the minimum number of discrete intervals while minimizing the loss of classattribute interdependency. The algorithm uses classattribute interdependency information as the criterion for the optimal discretization. The Class Attribute Interdependency Maximization (CAIM) criterion measures the dependency between the class variable C and the discretization variable D for attribute F, for a given quanta matrix. The CAIM criterion is defined as
( max 2 / M )
where
1 2
S S
CAIM (C, DF) =
where,
=1 r +r
n
S1 and S2 are the subsets of S partitioned by the cut point b
Gini (.) is the Gini measure defined by
n is the number of intervals,
r iterates through all intervals, i.e. r=1,2,…,n,
G(interval) = 1 –
( (i) )2
=1 j
p (i) is the jth class probability in ith interval and
maxr is the maximum value among all qir values
j
(i)
(maximum value within the rth column of the quanta
satisfies
=1 j
=1.
matrix), i=1,2,…,S,
. denotes the number of instances.
The training set is split into two subsets by the cut point which is chosen using Gini measure. Subsequent cut points are selected by recursively applying the same binary discretization method to one of the generated subsets, which has biggest Gini gain value, until the stopping criterion is achieved. The stopping criterion of the discretization algorithm is defined by
Gn+1 ln(n+1+p) > Gn ln(n+p)
Where
n denotes the current number of intervals,
p is a positive integer determined by the user, Gn is the Gini value with n intervals, defined by
M+r is the total number of continuous values of attribute F that are within the interval (dr1, dr].
The algorithm starts with a single interval that covers all possible values of a continuous attribute, and divides it iteratively. From all possible division points that are tried it chooses the division boundary that gives the highest value of the CAIM criterion. When the algorithm was tested on several well known datasets and compared with six other stateof theart discretization algorithms, the comparison showed that the CAIM algorithm generated discretization schemes with, on average, the lowest number of intervals and the highest dependence between class labels and discrete intervals, thus outperforming other discretization algorithms. The execution time of the CAIM algorithm is also mch
Gn =
Gn =
=1
S1 Gini(interval i)
S
shorter than the execution time of some other supervised discretization algorithms. The analysis of performance of the CAIM algorithm shows that the
When compared with EntMDLP algorithm, the results has shown that in many applications the Gini algorithm has better performance than EntMDLP algorithm and original C4.5 algorithm. So it can be a good alternative to the entropybased discretization methods.
algorithm that generates small number of intervals helps to reduce the size of the data and improves the accuracy and the number of subsequently generated rules.
The Khiops discretization method proposed by Marc Boulle [6] is a bottomup method based on the global optimization of chisquare(2). 2 is a statistical
measure that conducts a significance test on the relationship between the values of a feature and a class. 2 statistic determines the similarity of adjacent intervals based on some significance level. It tests the hypothesis that two adjacent intervals of a feature are independent of the class. If they are independent, they should be merged, otherwise they should remain separate. The formula for computing 2 value is
2 = 2 ( Aij Eij)2
effective technique to capture the correlations between intervals/items and classes. The one that gives the highest correlation with the classes is selected as a cutpoint. The geometrical representation of MCA not only visualizes the correlation relationship between intervals/items and classes, but also presents an elegant way to decide the cutpoints. The graphical representation of MCA is called symmetric map. It is used to visualize the
intervals of a feature and the classes as points in a
where:
=1
=1
Eij
twodimensional map. The correlation between an interval and a class can be represented by the cosine angle between these 2 vectors in the first 2
p = number of classes
Aij = number of distinct values in the ith interval, jth class
dimensions. The larger the cosine value of the angle is the stronger the correlation between them.
For a numeric feature Fi , all values of this feature are sorted to form a set of n+1 distinct values. Candidate cut points are the mid points of all adjacent pairs in
=1
=1
Ri = number of examples in ith interval =
ij
the set. The cut point with the largest cosine is selected as the first cutpoint T1. Then the same
=1
=1
Cj = number of examples in jth class=
ij
strategy can be carried out separately in the left and right intervals in a binary recursive way. The
=1
=1
N = total number of examples =
Cj and
recursion is terminated if the correlation between current intervals and classes is lower than the
Eij = expected frequency of Aij = (Ri * Cj) / N
The discretization method starts from the elementary single value intervals and then searches for the best merge between adjacent intervals. Two different types of merges are encountered. First, merges with at least one interval that does not meet the constraint and second, merges with both intervals fulfilling the constraint. The best merge candidate (with the highest chisquare value) is chosen in priority among the first type of merges (in which case the merge is accepted unconditionally), and otherwise, if all minimum frequency constraints are respected, among the second type of merges (in which case the merge is accepted under the condition of improvement of the confidence level). The algorithm is reiterated until both all minimum frequency constraints are respected and no further merge can decrease the confidence level. When compared with other chi square based methods like ChiMerge and ChiSplit methods, this global evaluation carries some intrinsic benefits. The Khiops automatic stopping rule brings both ease of use and high quality discretizations. Its computational complexity is the same as for the fastest other discretization methods.
Quisha Zhu, Lin Lin, MeiLing Shyu and ShuChing Chen [7] proposed a novel and effective supervised discretization algorithm based on correlation maximization (CM). It is proposed by using multiple correspondence analyses (MCA). MCA is an
correlation between their predecessor and their classes. When compared with other discretization algorithms, this algorithm produced relatively small number of intervals and also has a low computational complexity. The drawback of this method is, it cannot discretize the datasets containing more than 2 classes.

Conclusion and Future Scope.
Discretization of continuous features plays an important role in data preprocessing. This paper briefly introduces that the generation of the problem of discretization brings many benefits including improving the algorithms efficiency and expanding their application scope. From the past few decades much work has been done in this area resulting in many different discretization methods. Choosing a suitable discretization method largely depends on the user need for discretization, as well as on the kind of data to be discretized. While a lot of work has been done, there are still many issues that remained unsolved, and new methods are needed to address these issues. In future we expect robust discretization techniques which can overcome the drawbacks of handling huge data and large number of attributes.
References

Salvador GarcÂ´a, Juli Â´an Luengo, JosÂ´e A. SÂ´aez, Victoria LÂ´opez, and Francisco Herrera A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning IEEE Transactions On Knowledge And Data Engineering, Oct 2011.

A.K.C. Wong and D.K.Y. Chiu, Synthesizing Statistical Knowledge from Incomplete MixedMode Data, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. PAMI9, no. 6, pp. 796805, Nov. 1987.

U.M. Fayyad and K.B. Irani, MultiInterval Discretization of ContinuousValued Attributes for Classification Learning, Proc. 13th Intl Joint Conf. Artificial Intelligence, pp. 10221027, 1993.

XiaoHang Zhang, Jun Wu, TingJie Lu, Yuan Jiang A discretization algorithm based on gini criterion , Proceedings Of The Sixth International Conference On Machine Learning And Cybernetics, Hong Kong, 1922 august 2007.

Lukasz A. Kurgan, and Krzysztof J. Cios, CAIM Discretization Algorithm IEEE Transactions On Knowledge And Data Engineering, Vol. 16, No. 2, February 2004.

Marc Boulle Khiops: A Statistical Discretization Method of Continuous Attributes Machine Learning, 55, 5369, 2004.

Qiusha Zhu, Lin Lin, MeiLing Shyu, ShuChing Chen Effective Supervised Discretization for Classification based on Correlation Maximization Information Reuse And Integration (Iri), 2011 IEEE International Conference On 35 Aug. 2011.

A.An and N.Cercone, Discretization of Continuous Attributes for Learning Classification Rules, Proc. Third PacificAsia Conf. Methodologies For Knowledge Discovery And Data Mining, Pp. 509514, 1999.

S.Kotsiantis and D.Kanellopoulos Discretization techniques: A recent survey, GESTS International Transactions on Computer Science and Engineering, 32(1):4758, 2006.

H. Liu, F. Hussain, C.L.Tan, and M. Dash, Discretization: An enabling technique, Data Mining and Knowledge Discovery, vol. 6, no. 4, pp. 393423, 2002.

J.Y.Ching, A.K.C.Wong, and K.C.C.Chan, Class Dependent Discretization for Inductive Learning from Continuous and Mixed Mode Data, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 7, pp. 641 651, July 1995.

W. Huang, Discretization of Continuous Attributes for Inductive Machine Learning, masters thesis, Dept. Computer Science, Univ. of Toledo, Ohio, 1996.

J. Dougherty, R. Kohavi, and M. Sahami, Spervised and Unsupervised Discretization of Continuous Features, Proc. 12th Intl Conf. Machine Learning, pp. 194202, 1995.

Y. Yang, G. I.Webb, and X.Wu, Discretization
methods, in Data Mining and Knowledge Discovery Handbook, 2010, pp. 101116.

Fayyad, U. and Irani, K. Discretizing continuous attributes while learning bayesian networks. In Proc.
Thirteenth International Conference on Machine Learning. Morgan Kaufmann, pp. 157165, 1996.

I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical machine learning tools and techniques. 3rd Edition. Morgan Kaufmann, 2011.

Khurram Shehzad EDISC: A ClassTailored Discretization Technique for RuleBased Classification IEEE Transactions On Knowledge And Data Engineering, Vol. 24, No. 8, August 2012.