 Open Access
 Total Downloads : 14
 Authors : Syed Junaid Rahmathulla
 Paper ID : IJERTCONV4IS27008
 Volume & Issue : NCRIT – 2016 (Volume 4 – Issue 27)
 Published (First Online): 24042018
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
A Combined Clustering and Classification Approach for Predicting Placement Chance
Syed Junaid Rahmathulla
Department of MCA
Global Institute of Management Sciences Bangalore
Abstract An Educational data mining (EDM) is an emerging practice that emphasizes on the application of data mining tools and techniques on educational data. The practice focuses on extracting and analyzing educational data to develop models for improving learning experiences and institutional effectiveness. If this technology is made use for the benefit of the common man, then the purpose is served. The purpose of this paper is to help the prospective Management students in selecting or choosing a right Specialization in Master of Business Administration (MBA) namely, Finance, Marketing, HR, Operations etc., based on the entrance exam ranking for admission to PG course. In this paper, A combined clustering and classification is applied. Two clustering algorithms viz., K means and Support vector Clustering and a classification algorithm NaÃ¯ve Bayes are applied on the same data set. These algorithms are implemented, to predict accurately one among the various courses offered that predict better placement chances. Student will enter Rank, Gender, Category and Sector and the model will give answer in terms of Excellent [E], Good [G], Average [A] and Poor [P] for the data entered. Each and every course offered is associated with one of the above answers viz., E, G, A, P Such as, Operation E, Finance G, Marketing A,HR

Algorithms applied are compared in terms of precision, accuracy and truth positive rate. From the results obtained it is found that the cluster algorithm viz., Kmeans predicts better in comparison with other algorithms. This work will help the students in choosing a best course suitable for them which ensure best placement chances based on the data entered..
Keywords: Data mining, Naive Bayes, Confusion matrix, Support Vector Clustering, Kmeans, Predictionand modeling.

INTRODUCTION
Data mining consists of group of techniques to mine the data, such as association rule mining, classification and clustering. In this model, an algorithm is selected from clustering and two from classification models. Management profession is one of the excellent professions which have bright scope, The Master of Business Administration (MBA) is a master's degree and the duration and of the course is two years. Students acquire both a working knowledge of management functions and the analytical skills needed to practice them. Therefore selecting a right specialization at the time of post graduation will plays an important role in his/her carrier. Selection of specialization is arrived by accessing and analyzing previous years Management college data i.e, entrance exam rank. The main aim of doing this is to find the hidden patterns and characteristics of students relationships. Hence it helps to predict the future scope of specialization for him/her. For this there is a need of processing and comparisons of huge data. Classification techniques will
classify the data into predefined class label and clustering technique will group the data which has similar relationships. Data collected from Management colleges is in the excel format, which was fed to MySQL in the form of queries and two database were constructed, one containing the historical data and another test data.

PROBLEM STATEMENT
Every student dreams to be successful in life. In order to be successful in life he/she must select a good specialization after graduation . For him to be successful, choosing the right courses while studying is important. Hence a prediction model is proposed which helps the students to choose a course based on type of data or information that he/she furnishes. Among the fields or attributes that he/she enters, those attributes which contribute to the result are selected. Various mining algorithms from different models are applied on the processed data and tested accordingly. Algorithms are compared based on certain criteria such as accuracy, precision and truth positive rate.

RELATED WORKS

Many researchers have been working to explore the best mining techniques for solving placement chance prediction problems. Various works have been done in this regard. Few of the related works are listed below:

Seralahan, Kemal Polat. (1) Explains that the fuzzy sets are introduced into the knn technique to develop a fuzzy version of the algorithm.Ying Huang* and YandaLi(2)in this paper, fuzzy kNN method based on proteins dipeptide

Composition was proposed for prediction of subcellular locations. An advantage of the new method is its incorporating sequenceorder effects into prediction. This method was performed to a new data set derived from version 41.0SWISSPROT databank, and high predictive accuracy has been achieved in a jackknife test. Keller, J.M, Givens, J.A.(3)the fuzzy nearest classifier while not produce errors rate as low as knearest neighbor and it also attractive and desirable. kaimingting, zijianzheng(4)bayes algorithm to increase its instability and expect this to increase the success of boosting. yongchuan Tang, Yang Xu(5)present a method to identify a fuzzy model from data by using the fuzzy Naive Bayes and a realvalued genetic algorithm. The detection of a fuzzy model is comprised of the extraction of ifthen rules that is followed by the estimation of their parameters.Hongjun
Lu, Rudy Setiono, and HuanLiu(6)constructing and training a network to correctly classify tuples in the given training data set to required accuracy, pruning the network while maintaining the classification accuracy, and extracting symbolic rules from the pruned network. Mark W Craven, Jude W Shavlik(7)neural network methods deserve a place in the tool box of the data miner. Markus Brameier and Wolfgang Banzhaf(8)We have presented an efficient algorithm for the detection of noneffective instructions in linear genetic programs. The elimination of these introns before fitness evaluations results in a significant decrease in runtime.SudheepElayidom, Suman Mary Idikkula& Joseph Alexander proved that the technology named data mining can be very effectively applied to the domain called employment prediction, which helps the students to decide a good branch that may fetch them placement. A generalized structure for similar problems has been proposed. Ajay Kumar Pal, Saurabh Pal [12] presents a proposed model based on classification approach to find an enhanced evaluation method for predicting the placement for students. This model can conclude the relations between academic achievement of students and their placement in campus selection. A K Pal, and S Pal [13] frequently used classifiers are studied and the experiments are conducted to find the best classifier for predicting the students performance.


DATA MINING ALGORITHMS APPLIED

Brief Description of the Kmeans algorithm
Kmeans: Kmeans is the clustering algorithm. Concept used is partitioned clustering. It traverses each column headed by a category completely in the database and clusters it as a group. While traversing the column denoted by category, N objects with attributes are identified. Among the objects identified; objects with similar attributes are grouped to form a cluster. After traversing the entire database the number of K partitions formed will always be less then number of N objects. KMeans algorithm divides database in to partitions or Clusters which are disjoint subsets. If N is the data sets then thek will be disjoint subsets and each subset will be having its own id. Here k is always a positive integer number.

Data Preprocessing
The attributes that were present in the data provided to us; Name, age, gender, Rank, Category, sector, Address, register number, phone number.
Number of attributes that were found to be contributing to the result, after applying the chisquare test is as follows.
TABLE I. MAPPING INPUT VALUES TO NUMERIC VALUES.

Rank: obtained by student in PG entrance examination Range: (1 to 5000)

Category: social background Range (2A, 2B, 3A, 3B, GM, SC, ST, OBC).

Gender: Range (Male, female).

Sector: Range (Urban, Rural).

Specialization: Range (A to N).
All the input values would be mapped between 0 and 1 as given in the table below.


Application of KMeans algorithm on the data set:
Step 1: Initialize the value of k either manually or systematically.
Step 2: The database provided will be divided into number of groups based on the attributes as follows: Rank, Category, Gender, specialization and Sector.
TABLE II. INPUT FOR KMEANS ALGORITHM
Rank
Sector
Gender
Category
0.25
0
1
0.25
0.35
1
0
0.50
0.30
1
1
0.75
0.90
0
0
1
Table 2 is obtained from table 1 based on the application of numerical formulae. In the above table 0.25 in the rank attribute will be compared with all other values in the same column and the differences between the values are noted (0.25
0.35). Similarly the second value i.e., 0.35 will be compared with the rest of the values in the column (0.350.30). Same process continues for other values also and differences are obtained in each case. A column headed by difference is obtained and values which are closed to each other are grouped as cluster which forms centroid.
0.250.35=0.10, 0.350.30=0.5
0.10 And 0.5 forms a centroid provided there are no numbers less than the above values mentioned.
Step 3: while (! EOF)
{If (next value in the difference column is nearest to centroid)
{
Include in the cluster
}Else {Form a new cluster}
After each step the cluster sets gets updated which results in the formation of classified knowledge dataset. The student enters the data which would be compared with the classified
knowledge data set which predicts the specialization to be selected.
Distribution (D) for the kMeans algorithm is calculated using N
D = (dataset (i) center (dataset (i))) 2 If the1 value of D is close to 0 then
Category
Input Values
Numeric values
Gender
Female, Male
0 and1
Category
2A,2B,3A,3B
OBC,GM, SC,ST
0 and 1
Rank
1 to N
0 and 1
Sector
Rural, Urban
0 and 1
Branch
A to N
0 and 1
Chances
E, G, A, P
0 and 1
Algorithm performance is good
Else
Below average performance.

NAIVE BAYES:
Naive Bayes classifier is a probabilistic classifier that works based on the Bayes theorem.
The procedure to be followed while applying this method is as follows

Data preprocessing

Finding positive and negative knowledge data

Application of Bayes theorem
TABLE III. INPUT FOR NAÃVE BAYES
Name
Age
Gend er
sector
Category
Rank
Branc h
Deepa k
21
M
Rural
2a
200
Financ e
Imran
22
M
Urban
3b
100
Operat ions
Kavya
22
F
Rural
SC
400
HR
Ravi
22
M
Rural
SC
867
Market ing
Step 1: Data preprocessing:
Filling of the missing values and the dependency check on the attributes listed in the table 8 is performed using chi square test and Table 4 is a resultant after preprocessing.
Step 2:Finding positive and negative knowledge data:
selection constructs are applied on a rank attribute to get a positive and negative knowledge data.
If (rank <= 800)//the maximum limit of the possible rank
{Positive knowledge data} Else
{Negative knowledge data}
The above process is repeated for all the attributes listed in table 4 to get the positive knowledge data as given below.
TABLE IV. AFTER PREPROCESSING
Gender
sector
Category
Rank
Branch
M
Rural
2a
200
Finance
M
Urban
3b
100
operations
F
Rural
SC
400
HR
M
Rural
SC
867
Null
TABLE V. POSITIVE KNOWLEDGE DATA
Step 3: Application of Bayes theorem on table 5 gives the resultant output table.
At the first instance data in table 5 is converted to the numeric data. Formulae listed under are used to get the below output table as the resultant.
Hmap =max (P (h/D)) where P (h) =h/n
P (D) =D/n
h=hypothesis (possibilities)
d=data set (not possible)
n=number of data set
Hmap= max always calculate under the formula of P (D/h) P (h)/P (D)
And the maximum like hood calculate under the formula of P (D/h).
TABLE VI. OUTPUT TABLE
Rank
Gender
Sector
Category
Branch
Chance
1200
M
Rural
Any
Marketing
E
1200
M
urban
Any
Operations
E
1200
F
Rural
Any
HR
E


Support Vector Clustering (Svc):
It classifies or clusters the data set based on the complex pattern in the data. SVC follows machines; machines are the objects of a class as defined by the algorithms. SVC explores minute details regarding connections between data points in a dataset. Svc group data into resultant attribute space, obtained from the knowledge database. e.g., the knowledge database obtained from naÃ¯ve Bayes in our case acts as reference for SVC.
Duality: The function should be defined, which checks for the connection between the attributes in a dataset on the basis of which classification will be done.
Name
Age
Ge
nd er
secto r
Categ ory
Ran k
Branch
Shiva
21
M
Rural
2a
200
Marketing
John
22
M
Urba
n
3b
100
Operations
Rani
22
F
Rural
SC
400
HR
Application of the algorithm on a dataset
TABLE VII. INPUT FOR SVC ALGORITHM
viz., SVC (81%) and NaÃ¯ve Bayes (77%).KMeans algorithm leads with respect to true positive rate (TPR) with 0.83 correct instances and Precision (0.83).Thus the clustering algorithm KMeans predicts the results better than the other algorithms used.
Name
Category
Age
Sector
Rank
Gender
William
3B
30
Urban
550
Male
Michal
3B
38
Urban
549
Male
Bhagwan
2A
21
Rural
1
Male
Emmanvel
2A
32
Rural
13
Male
Reddy
3A
23
Urban
11
Male
Balu
3A
23
Urban
10
Male
Preprocessing: After the dependency check the name and age column will be removed from the input table.
Working of the algorithm: The value of an attribute in the given table 6 is compared with the value of the knowledge database. If it is found true, then it is formed as one cluster.
E.g. if (rank is equal to Knowledge dataset rank and the group exists) then include in same group.
Rank
Branch
Chance
1 to 13
Operations
E
549 to 550
Markating
A
TABLE VIII. OUTPUT OF SVC APPLIED ON RANK
CONCLUSION
Applying data mining techniques on educational data is concerned with developing methods for exploring the unique types of data; in educational domain each educational problem has specific objectives with unique characteristics that require different approaches for solving the problem.
In this study, a unique approach where in algorithms from classification and clustering models has been used for predicting placement chances. Two clustering algorithms viz., KMeans and SVC and a classification algorithm, naive bayes were applied. Among these algorithms, KMeans proved to be the best predicting algorithm representing cluster model, for solving placement chance prediction problems. The results obtained after application of the algorithm viz.,KMeans (83%) were compared with results obtained using Rapid miner (83.05%). Hence, having the information generated through our study, student would be able to select the appropriate specialization with best chances of getting placed. Furthermore, the work can be extended to solve problems on predictions, using different approaches on data of different disciplines.


IMPLEMENTATION

The algorithms used were implemented and the front end of the tool were developed using PHP and MYSQL as a database. Prospective Student will enter basic information like rank in the Postgraduate entrance exam, category etc., in the user interface developed and the application will predict the course suitable for the student which he/she can opt during selection of the Postgraduate course which provides better chances of placement.
TESTING
Data mining algorithms like KMeans, NaÃ¯ve Bayes, and Support Vector Clustering were applied on the same dataset and the tests were conducted separately. Results obtained after the tests for each algorithm were modeled as confusion matrix. Confusion matrix explains the performance of three algorithms expressed in terms of True Positive rate and Accuracy and Precision.
TABLE IX. CONFUSION MATRIX TABLE
Algorithms 
TPR 
Accuracy 
Precision 
Kmeans 
0.83 
83% 
0.83 
NaÃ¯ve Bayes 
0.80 
77% 
0.75 
Svc 
0.81 
81% 
0.81 
From the above table 9 it is clear that the Kmeans algorithm is more accurate with 83% compared to the other algorithms
BIBLIOGRAPHY

A new hybrid method based on fuzzyartificial immune system and knn algorithm for breast cancer diagnosis, Volume 37, Issue 3, March 2007, Pages 415423,Seralahan,Kemal Polat.

Prediction of protein subcellular locations using fuzzy kNN method,
Volume 20, Issue 1pages 2128,Ying Huang* and Yanda Li

A fuzzy Knearest neighbor algorithm, Volume:SMC15 Issue:4,pages 580 585,Keller, J.M, Givens, J.A.

Improving the performance of boosting for NaÃ¯ve Bayesian classification, volume 1574, 1999, pages 296305, kaimingting,zijianzheng.

Application of fuzzy NaÃ¯ve Bayes and a relvalued gentic algorithm in identification of fuzz model, volume 169, issue 34,2005,pages 205226, yongchuan Tang, Yang Xu.

Effective Data Mining Using Neural Networks, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO. 6, DECEMBER 1996,Hongjun Lu, Rudy Setiono, and Huan Liu.

Using Neural Networks for Data Mining, Future Generation Computer Systems special issue on Data Mining, Mark W Craven, and Jude W Shavlik.

A Comparison of Linear Genetic Programming and Neural Networks in Medical Data Mining, IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 5, NO. 1, FEBRUARY 2001, Markus Brameier and Wolfgang Banzhaf

A Generalized Data mining Framework for Placement Chance Prediction Problems International Journal of Computer Application (09758887) Volume 31 No.3, October 2011, SudheepElayidom, Suman Mary Idikkula& Joseph Alexander .

Classification Model of Prediction for Placement of students I.J.Modren Education and Computer Science, 2013, 11, 4956, Ajay Kumar Pal, Saurabh Pal .

Analysis and Mining of Educational Data for Predicting the Performance of Students, (IJECCE) International Journal of Electronics Communication and Computer Engineering, Vol. 4, Issue 5, pp. 1560 1565, ISSN: 22784209, 2013, A. K. Pal, and S. Pal,.