 Open Access
 Total Downloads : 5
 Authors : P. Sravanthi, K. Hema
 Paper ID : IJERTCONV3IS18008
 Volume & Issue : NCACI – 2015 (Volume 3 – Issue 18)
 Published (First Online): 24042018
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Extended RTree Indexing Structure for Ensemble Stream Data Classification
P. Sravanthi

ech Student, Department of CSE KMM Institute of Technology and Sciences
Tirupati, India
K. Hema
Assistant Professor, Department of CSE KMM Institute of Technology and Sciences Tirupati, India
Abstract Ensemble learning is one of the most important techniques in machine learning. Ensemble based solutions are more accurate to the problem of large scale data classification. Ensemble learning method handles both large volumes of stream data and concept drifting feature. All the ensemble methods to date are mainly concentrated in constructing more accurate ensemble models only but not on prediction of the class label of the incoming record. For a large set of ensemble classifiers, linear scan of all classifiers in the ensemble is a time taking process and it may not be suitable for many real life applications in prediction. Finding highly rated websites, business services, document flows, trends in share marketing etc are potential candidate applications for ensemble learning.
Currently the Linear time complexity of dynamic data stream classification/prediction using ensemble learning technique is costly in terms of response time. Ensemble learning techniques with linear time complexity are not suitable for many real world and time critical dynamic data stream applications such as spam detection, web traffic stream monitoring and intrusion detection. Newly proposed Extended Ensemble tree (EEtree) indexing structure technique reduces the time complexity from linear to sub linear in predicting a randomly selected stream record. E trees represent all the ensembles in a spatial database. Etree are height balanced tree structures with automatically updated features in addition to the continuously inserting new classifiers and deleting old classifiers and selfadapting to new trends, features, patterns in connection of changes in the dynamic data streams. Experimental results show that the performance of the proposed approach with sublinear time complexity is superior than the current approach with linear time complexity O(n).
KeywordsEEtree, spatial data, ensemble method, stream data classification

INTRODUCTION
Classification is an important technique in data mining. Dynamic data stream classification is also an important technique in dynamic data stream mining. Stream data mining has been popularly used in realtime intrusion detection, spam filtering, and malicious website monitoring [1]. In the applications, data arrive continuously in a stream fashion, timely predictions in identifying malicious records are of essential importance [1]. Dynamic data stream mining is used in many real life and time critical application such as intrusion detection, spam filtering and web traffic stream monitoring. In such applications data arrives continuously in a stream fashion. Predicting the
class label of desired stream record or identifying outlinear stream records are very important dynamic data stream functions.
Dynamic data stream classification approach suffers with many problems such as continuously fast increasing of dynamic data stream volumes, concept drift (the changes in the distributions of stream data) features, patterns, properties in dynamic stream data. To overcome many of the existing problems in dynamic data stream classification, many ensemble methods such as average classifiers ensembles, incremental classifier ensembles, combined classifier and cluster ensembles, optimal ensembles, have been proposed.
To date, existing works on ensemble learning in data streams mainly focus on building accurate ensemble models. Prediction efficiency has not been concerned mainly because (1) prediction typically takes linear time, which is sufficient for general applications with undemanding prediction efficiency. (2) Existing works only consider combining a small number of base classifiers, e.g., no more than 30 [1].
Ensemble learning design technique is popular. Many stateoftheart ensemble learning techniques in data stream classification consider only the construction of accurate ensemble learning models with least priority given to prediction efficiently. For many real world applications, linear time complexities of existing ensemble learning models are sufficient because there is no demand for prediction efficiency. Existing ensemble learning techniques generally combine only a limited (finite) number of base classifiers. But in many real life applications, a large number of ensemble base classifiers are needed inorder to capture many desired patterns of dynamic stream data. In other words, we need ensemble learning models with sublinear (logarithmic) prediction time complexity. Generally, the linear relationship between prediction time and ensemble size is not at all suitable for many real world applications.
Sublinear (logarithmic) time complexity of prediction is possible by considering only shared patterns and features among all base classifiers. Decision trees are reliable ensemble classifiers and each base classifier is composed of a batch of decision rules. Each decision rule is represented as a spatial object in the decision space. Entire
ensemble model is represented as a spatial database. Popular indexing structures for spatial data are Rtree, R* tree, R+tree and Mtree. These type of trees have indexing structures with richer information – classifier weights, probability distributions and class labels can also be used as ensemble models are mainly used for fast prediction of incoming dynamic stream data records. In ensemble models a decision region is associated with different class labels.
Dynamic webpage stream monitoring through online is one best example for ensemble learning techniques that use divide and conquer technique to manage huge volumes of dynamic stream data with concept drifting. Ensemble learning models are scalable, low error rate, parallelizable, quickly adoptable to new features. Ensemble technique divides the stream data into partitions and builds one or more base classifiers for each partition and these base classifiers are combined for predicting the unknown stream record. The working principle of all the ensemble learning models is stated as the principle design that uses divide and conquer techniques to handle large volumes of stream data with concept drift or concept change.
Paper is organized as follows. In section II actual problem of ensemble stream data classification is defined. In section III ensemble Etree structure is given. In section IV algorithms are given. In section V stream data classification methods are given. In section VI conclusions are given.

PROBLEM DEFINITION
Dynamic stream records are represented as S={(xi, yi)}, where xi represents a set of k attributes and yi is a class label with two valuesyes or no. For simplicity a two class problem is considered. Stream data is divided into n partitions and n decision tree base classifiers c1, c2, c3.cn are constructed all n base classifiers are combined into a single ensemble classifier E. Each base classifier ci consists of x decision rules represented by if then clauses. Total number of decision rules are n*x = nx. All these nx decision rules are represented in spatial database as nx spatial objects. Main goal is to construct best ensemble model to predict incoming stream record accurately within logarithm (sublinear) time complexity O(log n). To achieve this goal each base decision tree classifier Ci is converted into a batch of spatial objects (SOs). This batch f spatial objects is called spatial database, SD. Entire ensemble model E is converted to a spatial database (SD) containing all spatial objects. With this idea given original problem is changed to classifying each incoming dynamic stream record r by systematically searching over the spatial database (SD).
EXAMPLE 1
Generally maps, graphs, cities, universities, and places are represented in Rtree indexing structure. In this paper ensemble decision tree classifiers are represented as a
spatial database. Each decision tree classifier is represented as a set of rules. Each decision rule is represented as a spatial object in the spatial database.
Assume S is a web monitoring dynamic stream with two classes and each stream record has two attributes A1 and B1. Values of attribute A1 are a1, a2, a3am and values of attribute B1 are b1, b2, b3bm where 0 <= ai<= 6 and 0<= bi<= 6. For simplicity only latest six classifiers c1, c2, c3, c4, c5 and c6 are considered. An ensemble E is created based on these six classifiers. Generally each base classifier represents a set of ifthen rules. Again for simplicity each base classifier is represented using only one ifthen condition. Ensemble E is used to predict the class label of an incoming dynamic stream record r. More generally a classifier is represented by anding/oring many ifthen expressions.
An ensemble model E is converted into a spatial database model (Spatial database) and the mapping procedure of converting ensemble model into Spatialdatabase model is shown in the fig.3. Entire spatial database is represented by a twodimensional rectangle A = (0, 0, 6, 6). Each base classifier ci is represented by a small black rectangle. The (i+1)th black rectangle drifts to the right side of the ith rectangle from bottom to top of the decision are of spatial database because of concept drifting. The complete ensemble classification model is represented by using a set of spatial objects, black rectangles. Collection of spatial objects (black rectangles) represents the spatial database (Spatial database). Ensemble stream model is such that for a given small circle x = (5, 5) as an incoming record, we would like to classify x as both accurate and fast in logarithmic time complexity, O(log n).

ENSEMBLE TREE INDEXING STRUCTURE
Ensembletree (Etree) data structure is a multiway height balanced indexing Rtree like structure. It is an extended R tree structure. Most important operations of Etree are insertion, search, deletion and all these operations are performed similar to Rtree operations.
A. Structure of ETrees
Etree contains two main components:

Rtree like indexing structure that stores all decision rules of the ensemble.

A twodimensional table structure that stores base classifier details such as IDs and weights of base classifiers.
Two components of Etree are directly linked to each other such that base classifier in the table is linked to its corresponding decision rules in the Rtree. Etree is height balanced multiway indexing tree structure. Etree stores all decision rules of the ensemble. An ensemble is a collection of rules from a group of classifiers. Each classifier is represented as a set of decision rules.
Etree creates and maintains two different types of nodes.
1. Pivot node 2. Leaf node
Each leaf node stores a set of decision rules represented by a heavily overlapping area in the spatial decision space. All decision rules are converted into spatial objects and each spatial object is represented in the form (Rectangle bounded by a decision rule, classifierid, and memory address of next decision rule). Classifierid represents the identification of the particular classifier that has been generated the decision rule.
(minimumrectangle, classifierid, sibling)
The entries of the pivot node in the Etree are represented as
ID
Decision rules
C1
If(0<=r1<=1)and (0<=r2<=1) then correct ; otherwise incorrect
C2
If(0.5<=r1<=2)and (0.5<=r2<=2) then correct ; otherwise incorrect
C3
If(1.5<=r1<=3)and (1.5<=r2<=3) then correct ; otherwise incorrect
C4
If(3<=r1<=4.5)and (3<=r2<=4.5) then correct ; otherwise incorrect
C5
If(4<=r1<=5.5)and (4<=r2<=5.5) then correct ; otherwise incorrect
C6
If(5<=r1<=6) and (5<=r2<=6) then correct ; otherwise incorrect
TABLE 1. Six base classifiers
(minimumrectangle, childpointer)
Where minimumrectangle is the smallest rectangle that covers all decision rules in its child nodes and childpointer is a pointer (reference) pointing to its child nodes. A non root node should contain entries between m and M as in the case of B+tree, where M is the maximum number of entries of the nonroot node and m is the minimum number of entries of the nonroot node. The root should contain at least two entries.
For simplicity we have taken only two dimensional spaces. In general, in ddimensional spatial space a decision rule covers a closed space = (0, 1, 2,n) where each i represents a closed bounded interval along ith dimension. The TABLE structure is called Etable that stores all the base classifier details of the selected ensemble, E. Each entry of the Etable is represented as
(Classifieridentification, Classifierweight, Pointer)
Where classifier identification represents identification of a particular classifier in the ensemble, E, classifierweight represents the weight of the classifier, and pointer represents or stores address or reference of the first component decision rule of the classifier.
Here the following TABLE 1 stores six classifier rules. Rtrees are popularly used as indexing structure to manage conventional spatial objects such as graphics objects, vehicle objects, maps, images, location objects, spatial objects, multimedia objects, mechanical objects, stars, satellites etc. Etree indexing structures are particularly used to manage a new and different type of spatial data objects, decision rules belonging to the ensemble, E
Decision rule is converted to Rectangle
Base classifier is converted to spatial objects
Ensemble is converted to spatial database
Decision rule is converted to Rectangle
Base classifier is converted to spatial objects
Ensemble is converted to spatial database
TABLE 2 Mapping an ensemble model to a spatial database

Etree stores decision rules that are constructed in a continuous attribute space and discreet attributes are converted into continuous attributes by applying suitable changes.

Each decision rule represented in the Etree covers a closed spatial space. All partially closed spatial spaces must be converted into fully closed spatial spaces by changing lower or upper bounds appropriately. For example, consider the decision rule of the classifier C1. This decision rule, (r1<=1.5) and (r2<=1.5), is half closed and it is converted into fully closed spatial space as (0<r1<=1.5) and (0<=r2<=1.5). In general, if a decision rule covers only a few dimensions (i) then it will be converted into remaining (di) dimensions also. Suppose, assume a decision rule that is defined in a partially closed spatial space R = (0<=r1<=1.2) then R must be changed to a closed space as
R = (0<=r1<=1.2) and (0<=r2<=6)

All decision rules of the ensemble E are taken as hard decision rules only but not fuzzy or other rules. For example, an image i correctly represented or not. Another example is, an image is correctly represented for the country or not.

For simplicity purpose, only binary classification model is considered same procedure can be used to extend this E tree indexing technique for representing multiclass problems in spatial space. In binary classification a simple technique of representing Etree indexing decision rules is that stores only the decision rule of a minor class containing small set of decision rules.


ALGORITHMS

ETree Insertion Algorithm
Algorithm 1: Inserting a node into Etree
Input :

E tree T

Classifier C (one or more rules)

m , minimum number of entries in a node

M, maximum number of entries in a node
Fig. 3 Inserting decision rules into the ETree
Output : updated or modified Etree T1

PT .tree.root // get root of Etree

Foreach decision rule RC do

LsearchLeaf(R, P)

If( L.size < m ) then 5.


UpdateParentnode (L)

Else

<PL,PR> splitNode(L, R)

adjustTree(T, PL, PR)

Endif

Endfor

Insert the classifier, C into table structure

Output the tree, , after inserting the new classifier, C
Explanation of ETree insertion operation
Whenever new trends and patterns are found during dynamic stream data classification, the Etree insertion algorithm insert new base classifiers into the ensemble model and this model is automatically converted into the spatial space data object and inserted into the Etree indexing structure. Whenever a new classifier, C, is created, a new entry associated with the newly created classifier is inserted into the Etable structure and similarly all the decision rules, R, belonging to the newly created classifier are inserted into the Etree structure one after the other, and linked all the rules together by the respective pointer entries.
Inserting decision rules of a new classifier into Etree is very similar to the insertion operation of Rtree indexing structure starting from the root search proceeds downwards in order to find a leaf node that covers the rules of the new classifier. Once a leaf node is found, it is checked to find whether space for insertion is available or not. If the leaf node contains less than the maximum number of entries, M, then new set of rules are inserted into the leaf node. Parent node is updated appropriately. After searching, if it is found that there is no space for new insertion, then the current leaf node is split into two nodes and then new set of rules are inserted. Node splitting in Etree is a difficult step.
Search operation of Etree
A search operation is initiated to find the class label of a newly incoming stream record, x. Search algorithm of E tree first starts from the root to downwards of the Etree and finds all the decision rules in the leaf node or leaf nodes covering the newly arrived dynamic stream record x. Class label of the newly arrived stream record is calculated by combining all the decision rules. After the completion of the search operation class label of the newly arrived dynamic stream record is calculated using the following formula
= (=1 ( ), ),———— (4)
where yx is the class label of newly arrived dynamic stream record, u is the total number of retrieved decision rules covering new record x, sgn(temp, ) is a threshold function that decides class label of the new record x, by comparing temp and , and Cindex is the indexing class (minor class) in the Etree. Etrees are particularly useful when the minor class that has fewer decision rules is indexed in Etrees. In the given example1, Cindex represents the target class (abnormal class). P(Cindex/x) is a hard posterior probability of either 0 or 1. Hard means the value is taken as either 100% correct or 100% incorrect.
The search algorithm of Etree performs a depthfirst search (DFS) technique for searching. Inorder to find class label of a newly arriving dynamic stream record x, the E tree search algorithm first traverses along the Etree branches whose rectangles cover x and then class label of x computed by using the equation 4. Classifierid and Wait of each classifier is given in the table. Search algorithm finds class label of the incoming stream record by using a technique known as overlapped patterns or shared patterns. More number of ensemble classifiers is required in order to reflect all the patterns, features, modifications, and other dynamic changes in the stream data classification. All the algorithms must be fast enough to maintain and update all the details of ensemble management.


ETree Search Algorithm
Algorithm 2: E_Tree Search
Input:

Etree T

Stream record, x

Parameter
Output: Class label of x (Here yx is class label of x)
1. Initialize(stack) // initialize a stack U
// U stores all records covering

While (stack ) do

e pop(stack)

if (e is an entry of a leaf ) then 9.

else

P e.child

foreach entry e P do

If ( x e) then

push (stack, e)

endif

endfor

endif

endwhile

foreach entry e U do

Find its weights in the table structure

Endfor
=
=
22. = ( 1 ( ), )
23. Output yx



ETree Deletion Algorithm
ETree deletion operation
We assume that there is a prespecified maximum capacity of classifiers in the Etree. Deletion operation in Etree deletes all the outdated and unuseful classifiers after the maximum capacity of Etree classifiers is reached. Assume that current maximum capacity ensembles in the Etree is 20 and c1, c2, c3..c20 are in Etree. Whenever maximum capacity of Etree is reached after the insertion of only new classifier, the old and unuseful classifiers are deleted automatically by Etree deletion algorithm. That is when C21 enters, and then C1 will be deleted.
The important deletion methods in Etree are:

Etree deletion, which is similar to the deletion in Btree.

Etree deletion which is similar to the deletion in Rtree
In the first method underfull nodes are merged into one of the siblings such that result will in the least area increase. In the second deletetheninsert operation is applied. Underfull node is deleted first and then remaining entries are inserted into the Etree using the Etree insertion operation. The second method is very easy to implement and it is also very advantageous. Also it is very to refine the existing spatial structure of the Etree after reinsertion. First Etree deletion method which is similar to Btree is inefficient and it results many node splits. Second method is the best method for deleting a node in Etree.
x

PT.tree.root // obtain the root of tree, T

Foreach entry R T do

Push(stack, R)

Endfor
Algorithm 3 ETree Deletion
Input:

Etree, T

Classifier, C

Parameters, m, M
Output:
updated Etree

P T.tree.root // obtain the root of the tree

A T.Table.ref // get the reference from the table

R SearchClassifier(C,A)

P R.pointer

While (P )

L P.node()

Q P.sibling

deleteEntry (L, P)

If (.size < m) then

deleteNode(T, )

foreach entry e do

insertRule(e,T)

endfor

endif

Pq

endwhile

Output


Proposed Algorithm 4 for ETree Deletion
A new method is proposed in the ETree deletion operation to delete a selected classifier and its associated decision rules from the ensemble .We use a specific procedure. In the normal Etree a classifier is deleted by first in first out principle. At any time Etree ensemble size is constant. Whenever there is need to delete a classifier from the ensemble we will delete the oldest classifier first.
We propose a new method for deleting a classifier from the ETree ensemble. Fist we will calculate accuracy of each classifier, Ci, in the ETree ensemble on the incoming stream test data, and then delete a classifier whose accuracy is smallest. In other words, test each classifier on the test stream data and then discard the classifier whose misclassification error is maximum. Also there are other methods for deleting a classifier from the ensemble. Some examples are false positive rate, false negative rate, mean, moving average, weighted moving average, some sampling techniques, and other methods that can detect concept drift or concept change.
Algorithm 4 Proposed Algorithm for ETree Deletion
Input:

A set of base classifiers constructed and represented

in
the ETree. 2.test stream data
Output:

a set of base classifiers after deleting the least important base
classifier.

get test data from the incoming input stream data.

classifier_id 1.

newaccurancy 0.

foreach base classifier, from the ensemble, E do

accurancyi compute accuracy of classifier Ci using selected test stream data.

if(accurancyi > newaccurancy) then

newaccurancy accurancyi

classifierid i

endif.

end for.

E_Tree_deletion( ETree T, classifier Ci )
A decision classifier C4.5 is used to generate decision rules from the dynamic stream data. All decision rules are hard decision rules only. That is decision rules are not fuzzy. Various measures used for online query evaluation are:

Time cost computational cost of Etrees is much lower than the computational cost of traditional ensemble models because Etrees are height balanced indexing tree structures that are used to index all classifiers in the ensemble.

Memory cost Through Etrees consumes larger memory during stream data record classification, the memory size is in affordable range only.

Accuracy prediction accuracy of Etrees is high and it is equal to the accuracy of original ensemble models.
ETree ensemble learning
Architecture of ensemble models on dynamic data streams using Etree indexing structure is shown in the fig 4. Training module maintains and controls all the insertion and deletion operations of Etrees. Training module is responsible to monitor Etree operations for constant updating of Etree. Prediction module is responsible to predict the class label of newly arriving dynamic stream record x by using synchronized copy of Etree received from training module. Etree search operation is applied to make online predictions. We assume that stream is coming with unlabeled data. Initially the buffer in the training module is filled with incoming stream records. Records in the buffer are labeled by human experts or intelligent labeling machines. This labeling process is very slow, costly and time consuming and only a small percentage of
incoming records at regular intervals regularly in order to provide uniform labeling. As soon as buffer is full, automatically a new classifier is constructed and then it is inserted in the ensemble. New classifiers of Etree are constructed regularly. Once the maximum capacity of the Etree is reached automatically old or outdated or unuseful classifiers are deleted from the Etree by executing Etree deletion operation. Updated and latest Etree will be synchronized and Etree copy is passed to the prediction module.
B. Ensemble learning models
Ensemble learning model is average or weighted or group based approach that uses a divideandconquer strategy. Decision tree uses nonparametric approach and ensemble model predominantly uses decision tree classifiers. Ensemble learning model in the first step splits continuous dynamic stream data into small groups of tuples and then constructs lightweight base classifiers (decision trees) for these small groups. Ensemble E is constructed by combining all these light weight base classifiers in a systematic way. Ensemble learning models scale well to
Unlabeled stream Prediction module
ERtree
Labeled stream
Synchronization
very large volumes of dynamic stream data, very easy to parallelized this model, it adapts very quickly to new and dynamic changes, patterns, graphics, and it is susceptible to
Buffer
Training Insertion
ERtree
lower variance errors, and very useful for spatial data maintenance. Ensemble stream data model also reasonably good in managing concept drift features.
Classifier (decision rules)
Classifier (decision rules)
If not full
Deletion if null
Labeling experts
Training module
Fig 4. Architecture of ensemble learning with ERtree


STREAM DATA CLASSIFICATION METHODS
Dynamic data stream classification methods are roughly divided into two categories (groups):

Online or incremental modes and

Ensemble learning modes
A. Online incremental models
Main goal of online incremental models is to build a single robust, efficient, effective and sophisticated model that can be continuously updated in order to reflect the latest changes in trends and patterns of dynamic stream data. Very fast decision tree(VFDT) model and the incremental SVM model are best examples for online incremental models.


CONCLUSIONS

In this paper we proposed a new algorithm for E Tree deletion. Previously time complexity of n base classifiers in the ensemble of stream data mining is O(n). Multiway ETrees are developed to decrease the time complexity from linear, O(n), to sublinear or logarithmic, O(log n). When classifiers are deleted in first in first out fashion, some of the classifiers with reasonably good accuracy might have deleted. Hence, we propose a new deletion algorithm in ensemble tree (Etree deletion) deletion. New Etree deletion algorithm deletes only the classifier which gives less accurate results on the new test data selected randomly from the dynamic stream data.
REFERENCES

Peng Zhang, Chuan Zhou, Peng Wang, Byron J. Gao, Xingquan Zhu, and Li Guo IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 2, FEBRUARY 2015

J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, third ed. Morgan Kaufmann, 2011.

J. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.