**Open Access**-
**Total Downloads**: 12 -
**Authors :**Apoorva A -
**Paper ID :**IJERTCONV4IS27007 -
**Volume & Issue :**NCRIT – 2016 (Volume 4 – Issue 27) -
**Published (First Online):**24-04-2018 -
**ISSN (Online) :**2278-0181 -
**Publisher Name :**IJERT -
**License:**This work is licensed under a Creative Commons Attribution 4.0 International License

#### Clustering Proficient Students using K-Means Algorithm

Apoorva A

Dept. of MCA

Global Institute Of Management Sciences Bangalore, India

AbstractEducational Data Mining is apart where in a combination of techniques such as data mining, machine Learning and statistics, is smeared on educational data to get valuable information. The objective of this paper is to cluster proficient students among the students of the educational institution to predict placement chance. Clustering is accomplished using k-means algorithm based on KSA( knowledge, Communication skill and attitude) To assess the performance of the algorithm, a student data set from an institution in Bangalore were collected for the study as a synthetic data. A model is proposed to arrive at the result. The accuracy of the results obtained from the algorithm was found to be promising.

Keywords Educational data mining, proficient student, k- means algorithm

INTRODUCTION

Educational Data Mining (EDM) is the presentation of Data Mining (DM) techniques to educational data, and so, its objective is to examine these types of data in order to resolve educational research issues.

An institution consists of many students. For the students to get placed, he/she should have good score in KSA. KSA is nothing but Knowledge, communication skills and attitude. This is one of the very important criteria for selection of student while placing him. It is also a known fact that better placements results in good admissions. All the students will not have high KSA score. Hence it is necessary to identify those students who possess good KSA and who dont. Thus there is a need for clustering to eliminate students who are not competent to be placed.

PROBLEM STATEMENT

Normally hundreds of students will be there in institutions. It is a tedious task and time consuming to predict placement chance for all students and it is not necessary also to predict placement chance for those students who are incompetent academically. Hence there is a need for clustering the proficient students having good KSA score whose placement chance can be predicted.

RELATED WORKS

Performance appraisal system is basically a formal interaction between an employee and the supervisor or management conducted periodically to identify the areas of strength and weakness of the employee. The objective is to be consistent about the strengths and work on the weak areas to

improve performance of the individual and thus achieve optimum process quality [8].(Chein and Chen,2006 [9] Pal and Pal ,2013[10]. Khan, 2005 [11], Baradwaj and Pal, 2011 [12], Bray [13], 2007, S. K. Yadav et al.,2011[14]. K-means is one of the best and accurate clustering algorithms. This has been applied to various problems. K-means approach belongs to one kind of multivariate statistical analysis that cut samples apart into K primitive clusters. This approach or method is especially suitable when the number of observations is more or the data file is enormous .Wu, 2000[1]. K-means method is widely used in segmenting markets. (Kim et al., 2006[2]; Shin &Sohn, 2004 [3]; Jang et al., 2002[4]; Hruschka& Natter, 1999[5]; Leon Bottou et al., 1995 [6]; Vance Fabere et al., 1994[7].

METHODOLOGY

Concept and research framework

The methodology along with its computational processes for determining the proficient student, is outlined below:

Step 1: Data collection.

Knowledge represented in terms of Marks scored in selected subjects of a student over a period of three years i.e., from June, 2011 to April, 2014 is considered and collected from an institution in Bangalore.

Step 2: Data preprocessing

Preprocessing was done using chi-square test for the goodness of fit to remove the attributes which doesnt contribute to the result.

Step 3: K-means clustering technique

This step clusters proficient students among all the students of the institution using K-means clustering algorithm.

Step 4: Evaluate the result

Variables

Description

Possible Values

Stu_id

Id of the student

{Int}

Name

Name of the student

{Text}

sub

Subject name

{Text}

M1,M2,M3,M4

Marks

subject

scored

in

each

{1, 2, 3, 4, 5…100}

T

Total marks

{ 1% – 100% }

Com

(Communication skills+Attitude) score out

of 10

{1, 2, 3, 4, 5…10}

Min

Minimum marks for

passing a subject

32

Max

Maximum marks for

passing a subject

100

DATA DESCRIPTION Table I: Database description

Stu_Id ID of the student. It can take any integer values.

Name:- Name of the student.

sub represents the name of the subject. It can take only text values ranging from A-Z.

M1,M2,M3 :various subjectmarks scored by a student. It can take only the numeric values from 0 to 100.

T: total marks scored by each student represented in the form percentage i.e., 1% to 100%.

Com: Communication and attitude score out of 10

Min:-Minimum marks for passing a subject

Max:- Maximum marks for passing a subject

EXPERIMENTAL EVALUATION

Step 1: Data collection

Table II : Input Table

stu_id

1

2

3

4

Name

vikas

guru

sayed

deepak

Sub

Min

Max

M1

M2

M3

M4

M5

Ca

32

100

20

98

45

92

Bi

32

100

23

98

69

83

Java

32

100

24

97

67

74

Se

32

100

25

96

89

92

Cf

32

100

26

95

88

88

Db

32

100

28

90

56

81

..

T

624

1910

1416

1482

Com

7

9

7

8

This is an extrat of the student database with the fields or variables listed above. Marks scored in selected subjects of a student over a period of three years i.e., from June, 2011 to April, 2014 is considered and collected from an institution in Bangalore.

Step 2: Data preprocessing:

Preprocessing is done using following statistical technique.

Chi-square test: is applied to remove the useless variable that doesnt contribute to the result. From the above table III name, max and min were removed.

Table III : Preprocessed table

Steps of the k-means algorithm is explained below:

Step 1: Clustering using k-means algorithm.

Step 1: Preprocessed table will be the input for k-means.

Step 2: Cluster proficient student segment [PCS] and determine the exact number of clusters. The value of K is incremented in each step and the results are shown below.

Partition of PCS is done initially by taking k=2

After Applying k means clustering with k=2, we have Table IV: Partial view of clusters of students, for

k=2

Cluster1

Cluster2

1

2

10

3

11

4

12

5

13

6

16

7

18

8

21

9

22

14

24

15

26

17

27

19

29

20

30

23

25

28

The above table shows the grouping of students into two groups. .

Table V : Difference between clusters for k=2

Cluster

Cluster1

Cluster2

Custer 1

0

0.229

Custer 2

0. 229

0

For k = 2, the distance between the groups are labeled, in this

is the minimum value.

For k=3 applying k means clustering, we have the following results

Table VI : Partial view of three clusters, for k=3

stu_id

1

2

3

4

Sub

M1

M2

M3

M4

M5

ca

20

98

45

92

bi

23

98

69

83

java

24

97

67

74

se

25

96

89

92

cf

26

95

88

88

db

28

90

56

81

..

T

627

1910

1416

1482

Com

7

9

7

8

Cluster 1

Cluster 2

Cluster 3

1

9

2

10

24

3

11

4

12

5

13

6

16

7

18

8

21

14

22

15

26

17

27

19

29

20

30

23

25

28

The above table indicates the partial view of 3 -clusters.

Table VII : Differences between clusters

Cluster

Cluster1

Cluster2

Cluster3

Custer 1

0

0.116

0.165

Custer 2

0.116

0

0.154

Custer 3

0.165

0.154

0

For k = 3, the distance between the groups are labeled, in this

0.12 is the minimum value

For k=4: we have the following results.

Table VIII : Partial view of four clusters, for k=4

Cluster 1

Cluster 2

Cluster 3

Cluster 4

1

9

3

2

10

4

11

5

12

6

13

7

16

8

18

14

21

15

22

17

24

19

26

20

27

23

29

25

30

28

Table IX : Comparison of distance between the clusters

Cluster

Cluster1

Cluster2

Cluster3

Cluster4

Custer 1

0

0.104

0.187

0.342

Custer 2

0. 104

0

0.083

0.238

Custer 3

0.187

0.083

0

0.154

Custer 4

0.342

0.238

0.154

0

Comparison table given above compares the two clusters in terms of distance between them. Cluster 2- cluster 1

=0.104537 given in row 1 column 3.Similarly the other values are calculated. This table is the resultant of application of k-means, incrementing value of k in every step by 1.

Table X : Cluster distance table

Number of cluster

The short cluster

distance

Cluster 2

0.2293

Cluster 3

0.1658

Cluster 4

0.3428

Cluster 5

0.3133

The first value 0.2293 in the shorter cluster distance field represents the distance between the cluster 1and 2,similarly the second value viz., 0.1658 represents the distance between 1 and 3. The other values in the table can be interpreted similarly.

From the above table it can be observed that, values in the shorter cluster distance attribute starts increasing by great extent i.e., from 0.1658 to 0.3428, after cluster 2..Hence it can be concluded that the maximum number clusters that can be formed is 3. So we choose k=3 and 3rd cluster because the centroid of the third cluster is nearest to maximum marks of the subjects i,e., 2000(20 subjects).

Step 2: Choosing the cluster

When k takes value 3i.e., k=3, the 3rdcluster is chosen as the best cluster as the centroid value of the third cluster is nearest to maximum marks of the subjects i,e., 2000(20 subjects).

Step 3: Identifying the elements of the cluster.

Table XI : Elements of Cluster 3

Cluster 3

2

3

4

5

6

7

8

14

15

17

19

20

23

25

28

The table above represents the elements of the best cluster identified.

RESULTS:

From the deduction the cluster 3 is found to be the best cluster having the number of proficient students given below:

Table XII : Elements of Cluster 3

Cluster 3

2

3

4

5

6

7

8

14

15

17

19

20

23

25

28

CONCLUSION

The main objective was to identify proficient students by clustering using k-means algorithm. It was found that cluster 3 having proficient students emerged among the students of the institution as the best cluster. The algorithm has an accuracy of 89%. Thus the solution for the above problem was found successfully.

REFERENCES

Kim, S. Y., Jung, T. S., Suh, E. H., & Hwang, H. S. (2006). Student segmentation and strategy development based on student lifetime value: A case study. Expert Systemswith Applications, 31(1), 101 107.

Shin, H. W., &Sohn, S. Y. (2004). Product differentiation and market segmentation as alternative marketing strategies. Expert Systems with Applications, 27(1), 2733.

Jang, S. C., Morrison, A. M. T., & OLeary, J. T. (2002). Benefit segmentation of Japanese pleasure travelers to the USA and Canada: Selecting target markets based on the profitability and the risk of individual market segment. Tourism Management, 23(4), 367378.

Hruschka, H., & Natter, M. (1999). Comparing performance of feed forward neural nets and k-means of cluster-based market segmentation. European Journal of Operational Research, 114(3), 346353.

Leon Bottou, YoshuaBengio, Convergence Properties of the K- Means Algorithms, Advances in Neural Information Processing Systems 7, 1995.

Vance Fabere, Clustering and the Continuous k-Means Algorithm, Los Alamos Science, 1994.

Rakesh Agrawal, Tomasz Imielinski, Arun Swam, Mining Association Rules between Sets of Items in Large Databases, ACM SIGMOD Record, 1993

Archer-North and Associates, Performance Appraisal, http://www.performance-appraisal.com, 2006, Accessed Dec, 2012.

Chein, C., Chen, L., "Data mining to improve personnel selection and enhance human capital: A case study in high technology industry", Expert Systems with Applications, In Press (2006).

K. Pal, and S. Pal, Analysis and Mining of Educational Data for Predicting the Performance of Students,(IJECCE) International Journal of Electronics Communication and Computer Engineering, Vol. 4, Issue 5, pp. 1560-1565, ISSN: 2278-4209, 2013.

Z. N. Khan, Scholastic achievement of higher secondary students in science stream, Journal of Social Sciences, Vol. 1, No. 2, pp. 84-87, 2005.

B.K. Bharadwaj and S. Pal. Mining Educational Data to Analyze Students Performance, International Journal of Advance Computer Science and Applications (IJACSA), Vol. 2, No. 6, pp. 63-69, 2011.

M. Bray, The shadow education system: private tutoring and its implications for planners, (2nd ed.), UNESCO, PARIS, France, 2007.

S. K. Yadav, B.K. Bharadwaj and S. Pal, Data Mining Applications: A comparative study for Predicting Students Performance, International Journal of Innovative Technology and Creative Engineering (IJITCE), Vol. 1, No. 12, pp. 13-19, 2011.