(IJCSIS) International Journal of Computer Science and Information Security, Vol. 7, o. 1, 2010
Application of k-Means Clustering algorithm for prediction of Students’ Academic Performance Oyelade, O. J Department of Computer and Information Sciences, College of Science and Technology, Covenant University, Ota, Nigeria.
[email protected].
Oladipupo, O. O Department of Computer and Information Sciences, College of Science and Technology, Covenant University, Ota, Nigeria.
[email protected].
With the help of data mining methods, such as clustering algorithm, it is possible to discover the key characteristics from the students’ performance and possibly use those characteristics for future prediction. There have been some promising results from applying k-means clustering algorithm with the Euclidean distance measure, where the distance is computed by finding the square of the distance between each scores, summing the squares and finding the square root of the sum [6].
Abstract— The ability to monitor the progress of students’ academic performance is a critical issue to the academic community of higher learning. A system for analyzing students’ results based on cluster analysis and uses standard statistical algorithms to arrange their scores data according to the level of their performance is described. In this paper, we also implemented k-mean clustering algorithm for analyzing students’ result data. The model was combined with the deterministic model to analyze the students’ results of a private Institution in %igeria which is a good benchmark to monitor the progression of academic performance of students in higher Institution for the purpose of making an effective decision by the academic planners.
This paper presents k-means clustering algorithm as a simple and efficient tool to monitor the progression of students’ performance in higher institution. Cluster analysis could be divided into hierarchical clustering and non-hierarchical clustering techniques. Examples of hierarchical techniques are single linkage, complete linkage, average linkage, median, and Ward. Non-hierarchical techniques include k-means, adaptive k-means, k-medoids, and fuzzy clustering. To determine which algorithm is good is a function of the type of data available and the particular purpose of analysis. In more objective way, the stability of clusters can be investigated in simulation studies [4]. The problem of selecting the “best” algorithm/parameter setting is a difficult one. A good clustering algorithm ideally should produce groups with distinct non-overlapping boundaries, although a perfect separation can not typically be achieved in practice. Figure of merit measures (indices) such as the silhouette width [4] or the homogeneity index [5] can be used to evaluate the quality of separation obtained using a clustering algorithm. The concept of stability of a clustering algorithm was considered in [3]. The idea behind this validation approach is that an algorithm should be rewarded for consistency. In this paper, we implemented traditional kmeans clustering algorithm [6] and Euclidean distance measure of similarity was chosen to be used in the analysis of the students’ scores.
Keywords- k – mean, clustering, academic performance, algorithm.
I.
Obagbuwa, I. C Department of Computer Science Lagos State University, Lagos, Nigeria.
[email protected]
INTRODUCTION
Graded Point Average (GPA) is a commonly used indicator of academic performance. Many Universities set a minimum GPA that should be maintained in order to continue in the degree program. In some University, the minimum GPA requirement set for the students is 1.5. Nonetheless, for any graduate program, a GPA of 3.0 and above is considered an indicator of good academic performance. Therefore, GPA still remains the most common factor used by the academic planners to evaluate progression in an academic environment [1]. Many factors could act as barriers to students attaining and maintaining a high GPA that reflects their overall academic performance during their tenure in University. These factors could be targeted by the faculty members in developing strategies to improve student learning and improve their academic performance by way of monitoring the progression of their performance. Therefore, performance evaluation is one of the bases to monitor the progression of student performance in higher Institution of learning. Base on this critical issue, grouping of students into different categories according to their performance has become a complicated task. With traditional grouping of students based on their average scores, it is difficult to obtain a comprehensive view of the state of the students’ performance and simultaneously discover important details from their time to time performance.
II.
METHODOLOGY
A. Development of k-mean clustering algorithm Given a dataset of n data points x1, x2, …, xn such that each data point is in Rd , the problem of finding the minimum
292
http://sites.google.com/site/ijcsis/ ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 7, o. 1, 2010
variance clustering of the dataset into k clusters is that of finding k points {mj} (j=1, 2, …, k) in Rd such that
1 2 3 4 5 6 7 8 9 10 11
is minimized, where d(xi, mj) denotes the Euclidean distance between xi and mj. The points {mj} (j=1, 2, …,k) are known as cluster centroids. The problem in Eq.(1) is to find k cluster centroids, such that the average squared Euclidean distance (mean squared error, MSE) between a data point and its nearest cluster centroid is minimized. The k-means algorithm provides an easy method to implement approximate solution to Eq.(1). The reasons for the popularity of k-means are ease and simplicity of implementation, scalability, speed of convergence and adaptability to sparse data.
12 13 14 15 16 17 18 19 20
The k-means algorithm can be thought of as a gradient descent procedure, which begins at starting cluster centroids, and iteratively updates these centroids to decrease the objective function in Eq.(1). The k-means always converge to a local minimum. The particular local minimum found depends on the starting cluster centroids. The problem of finding the global minimum is NP-complete. The k-means algorithm updates cluster centroids till local minimum is found. Fig.1 shows the generalized pseudocodes of k-means algorithm; and traditional k-means algorithm is presented in fig. 2 respectively.
MSE = largenumber; Select initial cluster centroids {mj}j K = 1; Do OldMSE = MSE; MSE1 = 0; For j = 1 to k mj = 0; nj = 0; endfor For i = 1 to n For j = 1 to k Compute squared Euclidean distance d2(xi, mj); endfor Find the closest centroid mj to xi; mj = mj + xi; nj = nj+1; MSE1=MSE1+ d2(xi, mj); endfor For j = 1 to k nj = max(nj, 1); mj = mj/nj; endfor MSE=MSE1; while (MSE