 Open Access
 Total Downloads : 246
 Authors : Devikala. D, Kamalraj. N
 Paper ID : IJERTV3IS100801
 Volume & Issue : Volume 03, Issue 10 (October 2014)
 Published (First Online): 27102014
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Enhanced Student Bunglers Detection using Association Rules and Predicting Outliers
Devikala. D1
Research Scholar1, Department of Computer Science,
Dr. SNS Rajalakshmi College of Arts & Science, Coimbatore 641 049, India.
Kamalraj. N2
Head of the Department, Department of Computer Technology
Dr. SNS Rajalakshmi College of Arts & Science, Coimbatore641 049,India.
Abstract – Recently many countries show interest and concern about problem of failure students and the way to determine the main contributing factors that affects the students performance. The great deal of research is undergoing for identifying the factors for the low performance of students using the large amount of information stored in databases.
This paper proposes a novel classification approach with association rule mining and outlier detection. Data mining is applied after preprocessing the data and continues with association, classification and outlier detection. The main objective of the paper is to detect dropout and failure data as early as possible which shows the factors trying to reduce dropout and failure students. The outcomes are compared and best result is identified.
Keywords: Educational data mining (EDM), Classification, Association, Outlier detection.

INTRODUCTION
The innovation of information technology from various disciplines such as database technology, scientific data, machine learning, neural networks, information retrieval, statistics, etc leads to usage of large volumes of data storage in various formats like records, files, documents, images, sound, videos and many new data formats. The process of identifying meaningful patterns and relationships of a data within very large databases is data mining and it is also called as KDDknowledge discovery in databases. The steps involved before carrying out data mining are data cleaning, data selection, and preprocessing and data transformation.
The great deal of research [1] has been done on identifying the factors that affect the low performance of students at different educational levels using the large amount of information that current computers can store in databases. Current area of research in educational data mining is based on the development of methods for the better understand about students and the settings in which they learn [2]. The works show promising results with respect to, economic, sociological, educational characteristics which are more relevant in prediction of low academic performance [8] with some complexity of time and process by using various classification based algorithms. This paper proposes apriori algorithm in association rule mining for classification which provides more efficient results that the existing system. It reduces
the complexity of the system and the extreme data that is the data which is abnormal is detected by the outlier detection method. Density based outlier is used to detect the abnormal data. The result produced by the system is more accurate takes less time complexity and provides better performance.

LITERATURE REVIEW
Romero.c et al [2] studies about the educational data mining and the development of the studies by exploring the data. The paper deals with the introduction of the educational data mining with different types of user groups and types of educational environment of the user group which provides the data. The most common task by data mining technique to resolve the educational environment is listed out and finally some promising features are discussed.
N. V. Chawla et al [3] proposed a method of over sampling the abnormal class and undersampling the normal class can achieve better classifier performance by varying the loss ratios in class.
S. Kotsiantis et al [4] studies about the various methodologies that have been proposed for the betterment of failure students in the academics. The author proposes a local cost sensitive technique and concludes the framework which is more effective solution for the problem.
M.N.Quadril et al [6] studies about the work of data mining in predicting the drop out feature of students. He proposed decision tree technique for choosing the best prediction and analysis about the features of failure students. The author produces the lists that are predicted as likely to drop out of students from college that are handled by the management and teachers.

METHOD
This paper proposes a method for predicting the academic student failure belongs to the process of Knowledge discovery and Data mining. The stages of the method are:
Data preprocessing
Attribute selection
Collected data
Dataset

Outlier Detection: The data objects in the database that does not have general behavior that of normal data is called outliers. It is detected with outlier detection method and in this paper density based outlier
detection method is used [10].
Outlier detection
Association rule mining
Classification

Interpretation: The obtained models are analyzed to detect the failure student in the database.


PROPOSED ALGORITHM
Classification based decision tree and rule induction
Imbalanced dataset classification with SMOTE
Apriori algorithm
Density based approach

Association rule mining
In education data mining, association rule learning is a conventional and well researched method for determining interesting relations between attributes in large databases. Association rule Mining is mainly intended to recognize strong rules from databases based on confidence and different measures support. The preliminaries required for performing data mining on any data are discussed below.
Performance evaluation
Figure: 1.Method Proposed for Prediction of student failure

Data Gathering: All the available information about the student is gathered from which the set of factors that affects the student are identified and is fed into the dataset.

Preprocessing: In this stage before applying data mining algorithm the gathered data must be cleaned, transformed to variables with required selection, integrated properly and solve the problems in the dataset.

Classification based decision tree and rule induction: In this stage, Data mining algorithm is applied to predict the student failure. The white box technique is preferred for generating easily interpretable models. Classification method followed based on decision tree which is organized in a hierarchical structure. Rule induction is used to avoid many problems associated with decision trees.

Imbalanced dataset classification with SMOTE: SMOTE (Synthetic Minority Oversampling Technique) is used to solve the problem of imbalanced data classification which occurs when the number of instances in one class is much smaller than the number of instances in another class. It is activated during the preprocessing of data for balancing the class distribution.

Association rule mining: Association rule mining is used to find the relation between variables in the large database. Various algorithms are used for association rule mining. Apriori algorithm is used in this system. It follows the breathfirst search type to count the support of the itemsets and use candidate generation function for the process [9].
Let I= {I1, I2, I3.Im} be the set items and let D, be the task relevant data, a set o database transactions where each transaction T I. Each transaction is an association with an identifier, called transaction identification (TID). Let A be a set of items. A transaction T is said to contain A, only if A T. Association rule is an implication of the form A B, where A I, B I, and A B .The rule interest is of measures that are Support (s) and confidence (c). The result reflects the usefulness and certainty of the discovered rule. A support of 2% of the rule A B means that A and B exist together in 2% of all the transactions under analysis. The rule AB having confidence of 60% in the transaction set D means that 60% is the percentage of transactions in D containing A which also contains B. A set of items is also referred to as an item set. An item set which contains k items in it is also called as kitem set. The number transaction of the itemset is the occurrence frequency of the itemset. If I the relative support of an itemset get satisfied the minimum support of threshold, then it is a frequent item set. Association rule mining follows twostep process:

To discover all frequent itemsets: Each itemset will occur at least as frequently as a predetermined minimum support count.

Create strong association rules from frequent itemsets: The rules must always satisfy minimum support and confidence and the rules are called as strong rules.
3.2.2 Apriori Algorithm
Apriori is a seminal algorithm proposed by R. Agarwal and R. Srikant for mining frequent itemsets for Boolean association rules. The algorithm is based on prior knowledge of frequent itemset properties. The steps in generating frequent item set in Apriori algorithm are.
Let Ck be a candidate item set of size k and Lk, the frequent item set of size k. The iteration steps are:

Find frequent set Lk1

Join step: Ck is generated by joining Lk 1 with itself (Cartesian product Lk1 x Lk1)

Prune step (apriori property): Any (k 1) size itemset is not frequent and cannot be a subset of a frequent k size itemset, hence it should be removed.

Frequent set Lk has been achieved.
Figure: 3 depicts association rules discovered from data of students with grade, with their support, confidence, and lift
Figure: 2. Association Rule Mining


Outlier detection
A database may contain data objects that do not comply with the general behavior of the data and are called outliers. The analysis of these outliers may help in fraud detection and predicting abnormal values. The data stored in a database may reflect outliers/noise, exceptional cases, or incomplete data objects. The incomplete data objects may confuse the analysis process which cause over fitting of data to the knowledge model constructed. So, as a result, the accuracy of the discovered patterns can be poor. The abnormal values in the result sheet of the students are detected by an application of outlier analysis. This may be due to many factors like data entry operator negligence, software fault, or an extraordinary performance of the student in a particular subject.
Outlier detection discovers data points that are significantly different than the rest of the data. In educational data mining outlier analysis can be used to detect students with learning problems. In this paper, we used outlier analysis to detect outliers in the student data. Outlier methods are used in this work which is Density Based Approach. It Computes local densities of particular regions and declare instances in low density regions as potential outliers. The method used is Local Outlier Factor (LOF), the Basic idea of LOF is to compare the local density of a point with the densities of its neighbors, and the result of applying this method is to flag the records with a percentage of outlier. The larger score means larger possibility of being outlier. Intuition (densitybased outlier detection): The density around an outlier object is significantly different from the density around its neighbors.
Method: The relative density of an object is used against its neighbors as an indicator of the degree of the object being the outliers.

kdistance of an object o, distk(o): distance between o and its kth NN

kdistance neighborhood of o, Nk(o) = {o o in D, dist(o, o) distk(o)}

Nk(o) could be bigger than that of k since multiple objects may have identical distance to o.

Reachability distance from o to o:
where k is a userspecified parameter.

Local reachability density of o:

LOF (Local outlier factor) of an object o is the average of the ratio of local reachability of o and those of os knearest neighbors.

The lower is the local reachability density of o, and the higher is the local reachability density of the kNN of o, and the higher is LOF.

A local outlier is captured whose local density is relatively low comparing to that of local densities of its kNN.

Figure: 3. depicts the outlier detection using local outlier factor for students.
Figure: 3. Outlier Detection


PERFORMANCE EVALUATION
The performance of the existing classification and prediction system with proposed grammar based genetic programming approach to derive the pass/failure result are tested. Measure the performance results in terms of the true positive rate (TPR), False positive rate (FPR), False Negative Rate (FNR) and True negative Rate (TNR), accuracy, Time comparison.
We analyze and compare the performance offered by classification, classification with feature selection, imbalanced classification with SMOTE oversampling technique, and prediction using association rule mining, outlier detection approaches. The performance is evaluated by the parameters such as accuracy. Based on the comparison and the results from the experiment show the proposed approach works better than the existing system.
Accuracy
Accuracy is calculated from the below given formula as
Accuracy=
True positive + True negative
True positive + True negative +False positive + False negative

TP (True positive)
In a statistical hypothesis test, two types of incorrect conclusions can be drawn. The hypothesis can be inappropriately. A positive test results accurately reflects the test for the activity is analyzed. If the outcome from a prediction is p and the actual value is also p, then it is called a true positive (TP);
True positive rate ( TPR) =TP/P P= (TP+FN)
Where P is the positive. TP is the True Positive

TN (True negative)
A result that appears negative when it should not. A true negative (TN) has occurred when both the prediction outcome and the actual value are n is the number of input data.
True negative rate (TNR) =TN/N N= (TN+FN)
Where
N is the Negative value. TN is the True Negative.

FP (False positive)
A result that indicates that a given condition is present when it is not. However if the actual value is n then it is said to be a false positive (FP).
False positive rate () = FP / (FP + TN)

FN (False negative)
False negative (FN) is when the prediction outcome is n while the actual value is p.
False negative rate () =FN / (TP + FN)
Accuracy comparison
Accuracy rate (%)
100
98
96
94
92
90
88
86
84
82
80
Methods
Figure: 4. Shows Accuracy Comparison graph
The graph shows the accuracy rate of existing system such that classification, classification with feature selection, imbalanced classification with SMOTE oversampling technique, and proposed system such as prediction using association rule mining, outlier detection approach using density based approach based on two parameters of accuracy and methods such as existng and proposed system. From the graph we can see that, accuracy of the system is reduced somewhat in existing system than the proposed system. From this graph we can say that the accuracy of the proposed system is increased which will be the best one.
VI.CONCLUSION AND FUTURE WORK
The aim of this system is to analyze the factor that affects the academic achievement of the students. It is useful in identifying weak students who are likely to perform poorly in their studies. Data mining and machine learning depend on classification which is the most essential and important task. An educational institution needs to have an approximate prior knowledge of enrolled students to predict their performance in future academics. The various data mining techniques can be effectively implemented on educational data. From the results it is clear that classification techniques can be applied on educational data for predicting the students outcome and to improve their performance for results. The efficiency of various decision tree algorithms is analyzed based on their accuracy and time to derive the tree. The predictions obtained from the system have helped the tutor to identify the weak students and improve their Performance. The classification accuracy and performance is high in the proposed system when compared to the existing system. The experimentation result gives the proposed system is more efficient than the existing system.
Finally, as the next step in our research can be carry out with more experiments using more data and also with different educational levels to test whether the same performance results are obtained with different DM approaches.
The future work continues as, to predict the student failure as soon as possible. To detect students risk in time before it is too late. To propose actions for helping students identified within the risk group. Then, to check the rate of the time to prevent the fail or dropout of that student previously detected.
REFERENCES

F. Araque, C. RoldÃ¡n, and A. Salguero, Factors influencing university dropout rates, Comput. Educ., vol. 53, no. 3, pp. 563 574, 2009.

C. Romero and S. Ventura, Educational data mining: A review of the state of the art, IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 40, no. 6, pp. 601618, Nov. 2010.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
Synthetic minority oversampling technique, J. Artif. Intell. Res., vol. 16, pp. 321357, Jun. 2002.

S. Kotsiantis, Educational data mining: A case study for predicting dropoutprone students, Int. J. Know. Eng. Soft Data Paradigms, vol. 1, no. 2, pp. 101111, 2009.

Carlos MÃ¡rquezVera, CristÃ³bal Romero Morales, and SebastiÃ¡n Ventura Soto Predicting School Failure and Dropout by Using Data Mining Techniques, IEEE journal of latinamerican learning technologies, vol. 8, no. 1, february 2013.

M.N.Quadri, Dr.N.V.Kalyankar, Drop Out Feature of Student Data for Academic Performance Using Decision Tree Techniques, Global Journal of Computer Science and Technology, Vol 10, No 2 (2010).

Devikala.D , Kamalraj.N, Data Mining Approaches on Detection of Students Academic Failure and Dropout: A Brief Survey, International Journal of Computer Trends and Technology (IJCTT) volume 14 number 3 Aug 2014

J. MÃ¡sEstellÃ©s, R. AlcoverArÃ¡ndiga, A. DapenaJaneiro, A. ValderrutenVidal, R. SatorreCuerda, F. LlopisPascual, T. Rojo GuillÃ©n, R. MayoGual, M. BermejoLlopis, J. GutiÃ©rrez Serrano, J. GarcÃaAlmiÃ±ana, E. TovarCaro, and E. MenasalvasRuiz,
Rendimiento acadÃ©mico de los estudios de informÃ¡tica en algunos centros espaÃ±oles, in Proc. 15th Jornadas EnseÃ±anza Univ. Inf., Barcelona, Rep. Conf., 2009, pp. 512.

http://en.wikipedia.org/wiki/Association_rule_learning#Apriori_algo rithm

http://en.wikipedia.org/wiki/Local_outlier_factor