 Open Access
 Total Downloads : 12
 Authors : Apoorva A
 Paper ID : IJERTCONV4IS27007
 Volume & Issue : NCRIT – 2016 (Volume 4 – Issue 27)
 Published (First Online): 24042018
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Clustering Proficient Students using KMeans Algorithm
Apoorva A
Dept. of MCA
Global Institute Of Management Sciences Bangalore, India
AbstractEducational Data Mining is apart where in a combination of techniques such as data mining, machine Learning and statistics, is smeared on educational data to get valuable information. The objective of this paper is to cluster proficient students among the students of the educational institution to predict placement chance. Clustering is accomplished using kmeans algorithm based on KSA( knowledge, Communication skill and attitude) To assess the performance of the algorithm, a student data set from an institution in Bangalore were collected for the study as a synthetic data. A model is proposed to arrive at the result. The accuracy of the results obtained from the algorithm was found to be promising.
Keywords Educational data mining, proficient student, k means algorithm

INTRODUCTION
Educational Data Mining (EDM) is the presentation of Data Mining (DM) techniques to educational data, and so, its objective is to examine these types of data in order to resolve educational research issues.
An institution consists of many students. For the students to get placed, he/she should have good score in KSA. KSA is nothing but Knowledge, communication skills and attitude. This is one of the very important criteria for selection of student while placing him. It is also a known fact that better placements results in good admissions. All the students will not have high KSA score. Hence it is necessary to identify those students who possess good KSA and who dont. Thus there is a need for clustering to eliminate students who are not competent to be placed.

PROBLEM STATEMENT
Normally hundreds of students will be there in institutions. It is a tedious task and time consuming to predict placement chance for all students and it is not necessary also to predict placement chance for those students who are incompetent academically. Hence there is a need for clustering the proficient students having good KSA score whose placement chance can be predicted.

RELATED WORKS
Performance appraisal system is basically a formal interaction between an employee and the supervisor or management conducted periodically to identify the areas of strength and weakness of the employee. The objective is to be consistent about the strengths and work on the weak areas to
improve performance of the individual and thus achieve optimum process quality [8].(Chein and Chen,2006 [9] Pal and Pal ,2013[10]. Khan, 2005 [11], Baradwaj and Pal, 2011 [12], Bray [13], 2007, S. K. Yadav et al.,2011[14]. Kmeans is one of the best and accurate clustering algorithms. This has been applied to various problems. Kmeans approach belongs to one kind of multivariate statistical analysis that cut samples apart into K primitive clusters. This approach or method is especially suitable when the number of observations is more or the data file is enormous .Wu, 2000[1]. Kmeans method is widely used in segmenting markets. (Kim et al., 2006[2]; Shin &Sohn, 2004 [3]; Jang et al., 2002[4]; Hruschka& Natter, 1999[5]; Leon Bottou et al., 1995 [6]; Vance Fabere et al., 1994[7].

METHODOLOGY
Concept and research framework
The methodology along with its computational processes for determining the proficient student, is outlined below:
Step 1: Data collection.
Knowledge represented in terms of Marks scored in selected subjects of a student over a period of three years i.e., from June, 2011 to April, 2014 is considered and collected from an institution in Bangalore.
Step 2: Data preprocessing
Preprocessing was done using chisquare test for the goodness of fit to remove the attributes which doesnt contribute to the result.
Step 3: Kmeans clustering technique
This step clusters proficient students among all the students of the institution using Kmeans clustering algorithm.
Step 4: Evaluate the result
Variables
Description
Possible Values
Stu_id
Id of the student
{Int}
Name
Name of the student
{Text}
sub
Subject name
{Text}
M1,M2,M3,M4
Marks
subject
scored
in
each
{1, 2, 3, 4, 5…100}
T
Total marks
{ 1% – 100% }
Com
(Communication skills+Attitude) score out
of 10
{1, 2, 3, 4, 5…10}
Min
Minimum marks for
passing a subject
32
Max
Maximum marks for
passing a subject
100

DATA DESCRIPTION Table I: Database description
Stu_Id ID of the student. It can take any integer values.
Name: Name of the student.
sub represents the name of the subject. It can take only text values ranging from AZ.
M1,M2,M3 :various subjectmarks scored by a student. It can take only the numeric values from 0 to 100.
T: total marks scored by each student represented in the form percentage i.e., 1% to 100%.
Com: Communication and attitude score out of 10
Min:Minimum marks for passing a subject
Max: Maximum marks for passing a subject

EXPERIMENTAL EVALUATION
Step 1: Data collection
Table II : Input Table
stu_id
1
2
3
4
Name
vikas
guru
sayed
deepak
Sub
Min
Max
M1
M2
M3
M4
M5
Ca
32
100
20
98
45
92
Bi
32
100
23
98
69
83
Java
32
100
24
97
67
74
Se
32
100
25
96
89
92
Cf
32
100
26
95
88
88
Db
32
100
28
90
56
81
..
T
624
1910
1416
1482
Com
7
9
7
8
This is an extrat of the student database with the fields or variables listed above. Marks scored in selected subjects of a student over a period of three years i.e., from June, 2011 to April, 2014 is considered and collected from an institution in Bangalore.
Step 2: Data preprocessing:
Preprocessing is done using following statistical technique.
Chisquare test: is applied to remove the useless variable that doesnt contribute to the result. From the above table III name, max and min were removed.
Table III : Preprocessed table
Steps of the kmeans algorithm is explained below:
Step 1: Clustering using kmeans algorithm.
Step 1: Preprocessed table will be the input for kmeans.
Step 2: Cluster proficient student segment [PCS] and determine the exact number of clusters. The value of K is incremented in each step and the results are shown below.
Partition of PCS is done initially by taking k=2
After Applying k means clustering with k=2, we have Table IV: Partial view of clusters of students, for
k=2
Cluster1
Cluster2
1
2
10
3
11
4
12
5
13
6
16
7
18
8
21
9
22
14
24
15
26
17
27
19
29
20
30
23
25
28
The above table shows the grouping of students into two groups. .
Table V : Difference between clusters for k=2
Cluster
Cluster1
Cluster2
Custer 1
0
0.229
Custer 2
0. 229
0
For k = 2, the distance between the groups are labeled, in this

is the minimum value.
For k=3 applying k means clustering, we have the following results
Table VI : Partial view of three clusters, for k=3
stu_id
1
2
3
4
Sub
M1
M2
M3
M4
M5
ca
20
98
45
92
bi
23
98
69
83
java
24
97
67
74
se
25
96
89
92
cf
26
95
88
88
db
28
90
56
81
..
T
627
1910
1416
1482
Com
7
9
7
8
Cluster 1
Cluster 2
Cluster 3
1
9
2
10
24
3
11
4
12
5
13
6
16
7
18
8
21
14
22
15
26
17
27
19
29
20
30
23
25
28
The above table indicates the partial view of 3 clusters.
Table VII : Differences between clusters
Cluster
Cluster1
Cluster2
Cluster3
Custer 1
0
0.116
0.165
Custer 2
0.116
0
0.154
Custer 3
0.165
0.154
0
For k = 3, the distance between the groups are labeled, in this
0.12 is the minimum value
For k=4: we have the following results.
Table VIII : Partial view of four clusters, for k=4
Cluster 1
Cluster 2
Cluster 3
Cluster 4
1
9
3
2
10
4
11
5
12
6
13
7
16
8
18
14
21
15
22
17
24
19
26
20
27
23
29
25
30
28
Table IX : Comparison of distance between the clusters
Cluster
Cluster1
Cluster2
Cluster3
Cluster4
Custer 1
0
0.104
0.187
0.342
Custer 2
0. 104
0
0.083
0.238
Custer 3
0.187
0.083
0
0.154
Custer 4
0.342
0.238
0.154
0
Comparison table given above compares the two clusters in terms of distance between them. Cluster 2 cluster 1
=0.104537 given in row 1 column 3.Similarly the other values are calculated. This table is the resultant of application of kmeans, incrementing value of k in every step by 1.
Table X : Cluster distance table
Number of cluster
The short cluster
distance
Cluster 2
0.2293
Cluster 3
0.1658
Cluster 4
0.3428
Cluster 5
0.3133
The first value 0.2293 in the shorter cluster distance field represents the distance between the cluster 1and 2,similarly the second value viz., 0.1658 represents the distance between 1 and 3. The other values in the table can be interpreted similarly.
From the above table it can be observed that, values in the shorter cluster distance attribute starts increasing by great extent i.e., from 0.1658 to 0.3428, after cluster 2..Hence it can be concluded that the maximum number clusters that can be formed is 3. So we choose k=3 and 3rd cluster because the centroid of the third cluster is nearest to maximum marks of the subjects i,e., 2000(20 subjects).
Step 2: Choosing the cluster
When k takes value 3i.e., k=3, the 3rdcluster is chosen as the best cluster as the centroid value of the third cluster is nearest to maximum marks of the subjects i,e., 2000(20 subjects).
Step 3: Identifying the elements of the cluster.
Table XI : Elements of Cluster 3
Cluster 3
2
3
4
5
6
7
8
14
15
17
19
20
23
25
28
The table above represents the elements of the best cluster identified.


RESULTS:
From the deduction the cluster 3 is found to be the best cluster having the number of proficient students given below:
Table XII : Elements of Cluster 3
Cluster 3
2
3
4
5
6
7
8
14
15
17
19
20
23
25
28

CONCLUSION
The main objective was to identify proficient students by clustering using kmeans algorithm. It was found that cluster 3 having proficient students emerged among the students of the institution as the best cluster. The algorithm has an accuracy of 89%. Thus the solution for the above problem was found successfully.
REFERENCES

Kim, S. Y., Jung, T. S., Suh, E. H., & Hwang, H. S. (2006). Student segmentation and strategy development based on student lifetime value: A case study. Expert Systemswith Applications, 31(1), 101 107.

Shin, H. W., &Sohn, S. Y. (2004). Product differentiation and market segmentation as alternative marketing strategies. Expert Systems with Applications, 27(1), 2733.

Jang, S. C., Morrison, A. M. T., & OLeary, J. T. (2002). Benefit segmentation of Japanese pleasure travelers to the USA and Canada: Selecting target markets based on the profitability and the risk of individual market segment. Tourism Management, 23(4), 367378.

Hruschka, H., & Natter, M. (1999). Comparing performance of feed forward neural nets and kmeans of clusterbased market segmentation. European Journal of Operational Research, 114(3), 346353.

Leon Bottou, YoshuaBengio, Convergence Properties of the K Means Algorithms, Advances in Neural Information Processing Systems 7, 1995.

Vance Fabere, Clustering and the Continuous kMeans Algorithm, Los Alamos Science, 1994.

Rakesh Agrawal, Tomasz Imielinski, Arun Swam, Mining Association Rules between Sets of Items in Large Databases, ACM SIGMOD Record, 1993

ArcherNorth and Associates, Performance Appraisal, http://www.performanceappraisal.com, 2006, Accessed Dec, 2012.

Chein, C., Chen, L., "Data mining to improve personnel selection and enhance human capital: A case study in high technology industry", Expert Systems with Applications, In Press (2006).

K. Pal, and S. Pal, Analysis and Mining of Educational Data for Predicting the Performance of Students,(IJECCE) International Journal of Electronics Communication and Computer Engineering, Vol. 4, Issue 5, pp. 15601565, ISSN: 22784209, 2013.

Z. N. Khan, Scholastic achievement of higher secondary students in science stream, Journal of Social Sciences, Vol. 1, No. 2, pp. 8487, 2005.

B.K. Bharadwaj and S. Pal. Mining Educational Data to Analyze Students Performance, International Journal of Advance Computer Science and Applications (IJACSA), Vol. 2, No. 6, pp. 6369, 2011.

M. Bray, The shadow education system: private tutoring and its implications for planners, (2nd ed.), UNESCO, PARIS, France, 2007.

S. K. Yadav, B.K. Bharadwaj and S. Pal, Data Mining Applications: A comparative study for Predicting Students Performance, International Journal of Innovative Technology and Creative Engineering (IJITCE), Vol. 1, No. 12, pp. 1319, 2011.