International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 5, May 2012)
Application of K-mean Algorithm in Software Maintenance 1
Rasmita Dash, Assistant Professor, Department of School of Computer Science & Engg., ITER Siksha ‘O’ Anusandhan University, Bhubaneswar-30, INDIA. 2 Rajashree Dash, Department of School of Computer Science & Engg., ITER Siksha ‘O’ Anusandhan University, Bhubaneswar-30, INDIA. 1
[email protected] [email protected]
2
Abstract— Software maintenance is one of the important and time consuming parts of software development lifecycle. Because maintenance covers the correction of errors ,the enhancement, deletion and addition of capabilities, the adaption of changing data requirements and operational environment, the improvement in performance, usability and other quality related attributes. Once the modification has been implemented, the software system has to be retested to gain confidence that it will perform according to the specification. For which once again test suite needs to be designed to test the modified module and the entire software. So the time spent in maintenance can be reduced by reducing the no. of test cases. We proposed a methodology based on clustering by which we can significantly reduce the test suite.
II. BACKGROUND 1.1Software Maintenance Software maintenance [1,7] is the process of modifying a software system or component after delivery to correct faults, improve performances or other attributes, or adapt to a changed environment.‖. This definition reflects the common view that software maintenance is a post-delivery activity: it starts when a system is released to the customer or user and encompasses all activities that keep the system operational and meet the user‘s needs. This view is well summarized by the classical waterfall models of the software life cycle [13], which generally comprise a final phase of operation and maintenance, as shown in figure 1.
Keywords— Keywords: Software maintenance, Software testing, test case, K-mean algorithm
Fig.1:Waterfall Model
I. INTRODUCTION Maintenance plays an important role in the life cycle of software product. It is estimated that there are more than 100 billion lines of the code in production in the world. As much as 80% of it is unstructured, patched and poorly documented. So maintenance can alleviate these problems. Our approach particularly deals with the above issue. Testing these redundant test cases even increases the time taken by maintenance phase. With increase in number of test cases this amount of time is significant. We have used data mining [15] approach to deal with the above issue. Data mining [15] is a novice research area where it is intended to extract patterns out of data that are not visible. In Data mining we have use clustering technique [15,4] which clusters similar data. The application of data mining techniques to our test suite significantly reduces the test suite. The coverage [16] either path or conditional by the reduced test suite yielded good results.
442
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 5, May 2012) It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.
Maintenance consists of four parts. Corrective maintenance deals with fixing bugs in the code. Adaptive maintenance deals with adapting the software to new environments. Perfective maintenance deals with updating the software according to changes in user requirements. Finally, preventive maintenance deals with updating documentation and making the software more maintainable. All changes to the system can be characterized by these four types of maintenance. Corrective maintenance is ‗traditional maintenance‘ while the other types are considered as ‗software evolution.‘ In maintenance process first the errors are identified, correction of errors are made through the addition and deletion of capabilities and then modifications are implemented on the existing software [8,9]. After that testing process is followed in order to meet the specification [12]. For which test cases needs to be designed to test the modified and entire software. It is very tedious to explore each and every possible test case manually [3,7]. Automated test case generation [2] has already been started where test cases are built automatically. This generates thousands of test cases through a simple program at a faster rate. But the problem with the above approach is if the software is built on thousands of lines of code, for execution of each test case it takes a lot of time to know the output and if the test suite is more the execution of the test suite may take even days to complete. Test suite contains test cases that are machine generated. It contains redundant test cases too. So to improvise the maintenance work optimization of test cases is must. A test suite [5,3] is a collection of automated generated test cases for a particular software. But due to the process of automation redundancy can be initiated in the process of test data generation. Redundancy is the repetition of data, between one test case and the other. So optimization of test suite is important to achieve by which lot of time can be saved from executing redundant or unnecessary test cases. The behavioral patterns exhibited by the test suite helps us in this process of automation. A test case is a collection of three attribute or tuple [I,S,O] i.e the input I and its corresponding output O to a particular system S.A good test case is on i.e able to find the faults in the software.
Clustering [4] is a data mining technique that makes meaningful or useful cluster of objects that have similar characteristic using automatic technique. This behavior of a cluster can be found out by different metrics like distance, density and grid based approaches. Within a cluster all set of points in that cluster are found to have similar behavior. 1.3 K-mean algorithm K-means one of the simplest unsupervised learning algorithms that solve the well known clustering problem[4] .The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of different location causes different result. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. Then the K-mean algorithm follows three steps until convergence. Iterate until stable (=no object move group) 1. Determine the centroid coordinate. 2. Determine the distance of each object to the centroid. 3. Group the object based on minimal distance. III. PROPOSED METHODOLOGY
1.2 Data Mining Data mining, the extraction of hidden predictive information from large databases [16] . Data mining software is one of no. of analytical tools for analyzing data.
Implementing data mining technique in software maintenance we proposed a method to reduce the no. of test suite. 443
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 5, May 2012)
1.4 Proposed Algorithm . Step-1 Generate test cases depending on no. of input and output for the modified component or for entire software. Step-2 Determine the no. of linearly independent paths for the software under test.(using path coverage and McCabe Cyclomatic Complexity) Step-3 Apply K-means algorithm for the generated test cases.(here K signifies the no. of cluster or no. of linearly independent paths in the dependency graph) Step-4 Perform testing taking a single test case from each cluster. Step-5 Repeat from step-3 until acceptable coverage is achieved.
TABLE I Test cases (2,10)
1.5 Sample test cases Suppose we have a software which takes only two inputs x and y within a given range. And different conditions are imposed in the software program for the values of x and y. An automated test case generator[2] generates randomly different values for x and y. Suppose it generates 1000 test cases, and if we will go for path coverage testing suppose there are 30 linearly independent paths are there in the software .what we can say the software possesses 30 different behavior or 30 different conditional statements may be in the software. So optimization can be made in the following way. Rather than testing with 1000 of test cases we can minimize the no. of test cases for maintenance. According to our methodology the 1000 test cases will be grouped into K number of cluster(using K-mean algorithm).As in our software we have 30 paths are there, so the total no. of cluster will be 30.Now picking up one of the test case from each of the 30 cluster will behave differently for each test case. So testing software with 1000 test case is equivalent to test the software with 30 test case one from each cluster. With this approach we can improve the maintenance time by eliminating testing with unnecessary test case.
Distance from 7.25 Centroid2 2.25
Distance from 10.83 Centroid3 5.8
Cluster
(2,5)
Distance from 2.6 Centroid1 4.4
(8,4)
9.4
5.75
1.16
3
(5,8)
2.4
6.75
5.83
1
(7,5)
7.4
5.75
0.83
3
(6,4)
7.4
3.75
2.16
3
(1,2)
8.4
3.25
9.16
2
(4,9)
1.6
6.75
7.83
1
(3,8)
0.4
4.75
7.83
1
(4,5)
4.4
2.75
3.83
2
(4,4)
7.4
2.75
6.16
2
(9,8)
6.4
10.75
4.83
3
(1,7)
3.4
5.25
7.4
1
(8,2)
11.4
6.75
5.75
3
(7,5)
7.4
3.16
0.83
3
1 2
Plot-1: Clustering Test cases
12 10 8 6 4
Consider a situation suppose there are 3 independent paths present in a segment of code and for it 15 different test cases have been designed. So for the execution of 3 different paths we can take all the test cases separately Which will be a time consuming process. According to our methodology that is using K-mean data clustering approach we will divide the test cases into 3 different cluster (as we have 3 different paths are there) shown in Table-1 and plot-1.
2 0 0
444
2
4
6
8
10
12
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 5, May 2012) [3] Harrold M.J.,Gupta R. and Soffa M.L.,1993, ―A Methodology for Controlling the Size of Test suite‖, ACM Trans. On Software Eng. Amd Meth.,2(3),pp.270-285.
Now testing with 15 different test cases is same as testing with only one test case from each cluster shown in Table-2 TABLE II
[4] Kanungo Tapas, Mount David M., 2002, ―A Local Search Approximation algorithm for K-mean Clustering‖, Communication of ACM.
Cluster-1
Cluster-2
Cluster3
(2,10)
(2,5)
(8,4)
(5,8)
(1,2)
(7,5)
reduction using Extended dependence analysis‖, Communication of ACM.
(4,9)
(4,5)
(6,4)
[6] Marre Martina aand Bertolino Antonia, ―Using spanning sets
(3,8)
(4,4)
(9,8)
[5] Chen Yanping & Probert Robert L., 2007, ―Regression Test suite
for coverage testing‖. IEEE transactions on software Engineering, vol.29.
(8,2)
[7]
(7,5)
(1,7)
Bertolino Antonia., 2007,‖Software testing Research: Achievements, challenges and dreams ‖Future of Software Engineering.
[8] Alkhatib,
G., 1 9 9 2 , ―The Maintenance Problem of Application Software: An Empirical Analysis‖, Journal of Software Maintenance – Research and Practice, 4(2):83-104.
By this approach we reduced the time wasted in maintenance due to unnecessary testing. This case is limited to only 2 variables, what if our program has more than 10 inputs. In such cases this approach reduces our effort further. The average running time of the testing now is generation of the test cases + cluster generation+ find no. of cluster + testing of the software with the new test suite. The running time was found to be far less than the actual execution of all the generated test cases. This optimization outperforms normal testing when the program to be tested is very huge.
[9] Arnold, R.
S., Bohner, S. A.,1993, ―Impact Analysis – Toward a Framework for Comparison‖, Proceedings of the Conference on Software Maintenance, Montreal, Canada, IEEE Computer Society Press, Los Alamitos, CA. pp. 292-301.
[10] Artur, L.
J., 1998, ―Software Evolution: The Software Maintenance Challenge‖, John Wiley & Sons, New York, NY.
[11] Bennett, K. H., Rajlich, V., 2000, ―The Staged Model of the Software Lifetime: A New Perspective on Software Maintenance‖, IEEE Computer, to appear.
[12] IEEE Std.1219-1998, 1998, ―Standard for Software Maintenance‖, I E E E C o m p u t e r Society Press, Los Alamitos, CA.
[13] ISO/IEC 12207, ―Information
Technology – Cycle Processes‖, Geneva, Switzerland, 1995.
IV. CONCLUSION
Software
Life
[14] Lientz, B. P., Swanson, B. E., 1980 , ―Software Maintenance
In this paper different problems related to software maintenance and how it can be solved. We discussed about automated generation of test cases and what problem may arise in this method. So to improvise the maintenance work we proposed optimized test cases generation technique using a data mining approach called as K-mean clustering technique. Using this technique we have minimized the number of redundant test cases. The methodology was further tested on test suite that contains many variables. The program containing several conditional checks and coverage of the conditions is tested for different values of k. As the k value increases the number of test cases to be tested increases and also there is improvement in coverage
Management‖, Addison- Wesley, Reading, MA.
[15] Ramesh Lilly, 2009, ―Knowledge Mining of Test Case System,‖ International Journal on Computer Science and Engineering,2(1), pp. 69-73.
[16] Mark Last and Menahem Friedman.2003, ‖The Data Mining approach to automated software ACM.
REFERENCES [1] Chapin N.,Hale J.E,Khan K. Md.,Ramil J.F,
and Tan W., 2001,‖Types of Software evolution and software maintenance‖, Journal of Software maintenance and Evolution: Research and Practice,13,pp.3-30.
[2] Ranjan Aritha , 2006, ―Automated Requirement Based Test case Generation‖, Communication of ACM.
445
testing.‖.Communications of