Automatic Generation of Package Diagram to Understand Java Packages. Li Jiang, Xiaobing Sun, Yun Li, Xiangyue Liu. School of Information Engineering, ...
Automatic Generation of Package Diagram to Understand Java Packages Li Jiang, Xiaobing Sun, Yun Li, Xiangyue Liu School of Information Engineering, Yangzhou University, Yangzhou, China {xbsun, liyun}@yzu.edu.cn, {875110457, 495296335}@qq.com
Abstract Program comprehension is a prerequisite in most software maintenance and evolution tasks. Given an unfamiliar system, it is difficult for practitioners to determine which software artifacts are relevant to the current task. Generally, there are a variety of packages in a Java software system. These packages often have different intents and different relationships between each other. Different information of packages and the relationships between different stereotypes packages form a signature of the system. This paper proposes a novel approach to automatically generate the description of the packages and its diagram to show relationships between the packages. The generated description and diagram can allow developers to more easily understand the main intent and structure of the system.
I. Introduction Program comprehension is a prerequisite in most software maintenance and evolution tasks. Given an unfamiliar system, it is difficult for practitioners to determine which software artifacts are relevant to the current task. Studies show that developers usually spare much more effort on reading and comprehending the source code than editing it [19], [24], [11], [21], [23]. Sometimes, developers only need to get a quick understanding of partial code and determine whether further detailed comprehension effort is necessary. If the target system has clear and timely documents, developers can acquire a clearer and quicker understanding of the source code, and efficiently identify which part in software is of interest or related to the maintenance task. Unfortunately, in most cases, good comments and documents are missing or outdated. Therefore, to gain a general understanding of the software system, it takes developers too much time to read and understand the code. One approach to overcome this problem is to automati-
978-1-4799-4860-4/14/$31.00 copyright 2014 IEEE ICIS 2014, June 4-6, 2014, Taiyuan, China
cally generate descriptive comments from the source code directly. While successfully applied for Java methods and classes [22], [15], generating comments for more complex code artifacts, e.g., packages, is significantly more difficult. There are many packages in a large- scale Java software system. Java package is an important component used for grouping related types which provides access protection and name space management. In addition, Java package is a basic unit usually used to indicate some functions of the software. And it is also a basic unit for developers to get a quick understanding of part of the code. Hence, for a Java system, a quick and accurate comprehension of its packages is helpful to understand the whole system. And a package usually contains some classes. When we understand a package, we need to analyze and count up the distribution of the classes with various types of stereotypes. Class stereotypes are categories that represent the intent of classes in the software design [15]. Therefore, developers still cannot acquire a quick understanding of it and locate the essential part of the source code. When the document is missing and outdated, it is also inconvenient for developers to open the single class file to browse and read it. For example, a developer is assigned a small functional module to implement during the process of development. To accomplish the task, any one module, as we all know, cannot be independent completely from the aspect of the development of the whole software system. When developers need to make use of the function or data, they may call methods in other part of the system. However, we may be not familiar with every module and component. On the other hand, modifying the code with some functions, during the process of maintaining a software system, cannot be avoid. However, due to the lack of documents and comments, developers are unable to gain even a superficial understanding. Our approach to deal with this problem is to generate the readable and understandable description and draw a high level abstract relationship diagram for the assembly of classes - package in Object-Oriented (OO) programming language, such as Java. JStereoCode is a tool used to identify class stereotype in Java code [16], [22]. It focuses
on the design intent, the main goal of class, and each one of its methods. Our focus is not only on the elements in a Java package, but also on the relationships between different packages in the entire system. We propose a novel approach to automatically generate a table with natural language descriptions and a diagram for Java packages. And we assume the documentation is missing (i.e., if it exists, the documentation is not analyzed currently). The system input is a Java project, and for each package in it, it outputs a table with the description of the package. For the whole system, our approach analyzes tables of the packages and makes use of a diagram to reflect the relationships between packages. Hence, to understand the packages in the system, the process includes two parts as follows: •
•
The description of the package’s elements. We conjecture that the classes stereotype and their distribution in a package are similar. These classes indicate some design intent, which reflects the main function and purpose of this package. And in this part, the description is displayed and represented in a table. The stereotype of the package and some classes constitute the label to describe the package. The diagram reflecting the relationship between packages. Firstly, we filter some packages by their label generated in the above part. Then, we study the relationship between classes by considering the number and action of a package’s classes connecting with external packages’ classes. And in this part, a diagram is generated to represent the relationships between classes clearly.
The rest of the paper is organized as follows: in the next section, we introduce our approach to understand each packager label, generate the table to describe the essential information based on given class stereotype taxonomy technology, and give a diagram to show the relationships between them. Following that is the related work in Section 3. Finally, we conclude and show some future work in Section 4.
II. Our Approach Although the automatically generation of summaries for Java classes and the entire methods have been proposed [22], [15], they could not be directly used to understand the packages in a system. The package is more complex than class, because the content in a package is much more different. As the type of methods in a class can reflect the intent of class. Similarly, we must study the classes in a package and understand the type of classes to get the intent. We summary the types of classes and packages as a rule to classify Java packages as shown in Table 1.
The package is defined as a container of the classes. Package can divide the classes into different namespaces to help developers manage the classes. Package often contains classes, interfaces, components, etc. On the other hand, when we group various classes into a package, we usually follow three empirical rules [13]: 1) The classes with inheritance relationship should be included in the same package. 2) The classes with combination relationship should be included in the same package. 3) The classes containing more collaborations should be included in the same package. So each package could be considered as a relatively independent part of the whole system. The following section discusses more details about how to generate the diagram for a package, which includes three steps: label identification, classes and content selection, diagram generation.
A. Label Identification The label is defined as an abstract description for the content in the package. And we need to filter the content of the package by its label. The label contains two parts. The first part is the main label and the second is stereotype of classes. To label the package, the class stereotype in a package should be taken into account. Here, we adapt the class stereotype identification technique proposed in [6], [5] to facilitate package labeling. Before identifying the stereotype of a class, we need to further analyze its methods stereotype. The classes stereotypes and method stereotypes have been defined in previous work. There are 15 stereotypes for methods and 13 stereotypes for classes [1], [5], as shown in Table 1. When we identify the label for a package, we do not pay more attention to internal package1 . Specifically, the package label is identified based on the number of different classes stereotypes. For example, data provider stereotype accounts for the largest proportion. Then, we search Table 1, and find data provider stereotype belongs to Entity package label. So we combine the package Entity and stereotype of class, i.e., data provider, as the label, that is, Entity data provider. When the class stereotype is identified, the class description is generated using the approach proposed by G. Sridhara [15]. Finally, the package labels are obtained as shown in Table 1. However, the above operation only considers the classes in a package, but cannot identify the package which contains interfaces. So we need further identify the component package which includes interfaces. First, we should search the source code in the package and give a stamp to each 1 We
can use the same operation to identify its label.
TABLE I. Package label classification Stereotype of classes
Entity Boundary Controller
Package label Entity Minimal Entity Data Provider Commander Boundary Controller Pure Controller
interface. Stamp is a variable to record the number of class implementing an interface. Then, when we find the class or classes that implement the interface. We will increase the number of the stamp given to this interface. In the end, we count up the number of interface stamp in the package, and identify the package label based on the following condition: |implements| >θ |classes| In the formula, |implements| represents the number of classes implementing the interface. |classes| represents the total number of classes in a package, which does not include the number of interfaces and others in the package. θ is determined by the user to adjust the accuracy rate for different systems. So we need to pre-estimated the θ to respond to different systems. The label above cannot express the type of the package. Thus we also need to add an extra label to show the type of the package, which makes our label more exact, as shown in Fig. 1 and Table 2. Fig. 1 is an overall description of the package. And we generate the textual description based on the package label and class label.
B. Classes and Contents Selection The label above just fixes the package’s type simply. Developers still cannot acquire enough information about the system or package. Therefore, we need to select some specific part of the program to display the content of a package. Once the label is determined, we need to identify which classes should be considered to support package comprehension. These classes are selected by the package label proposed above. For example, the package label is Entity entity and its type belongs to data provider. We will focus on the types of entity classes and data provider classes. Then, we choose the important classes from a collection of entity and data provider classes. There are two steps in the selection process. (1) We choose the classes by their methods’ account. The method must be related to the class stereotype. We control the number of selected classes below 30%. (2)We extract important content in the classes selected in Step (1). As we have identified the methods in the
classes, we can choose the methods according to the classs’ stereotype. In addition, the number of the selected methods should be limited by a percentage, which is at a range between 20% − 30%. Finally, we scan the signature of the methods and organize the words to generate the text for the description of the methods as shown in Fig.2 and Table 2. The content is generated based on a template we have defined. In addition, the generated text and label will be displayed in the package diagram and package information table. For example, after analyzing our example com.example.Products, we will get the classes and contents selected from it and build the information table to help the developer have a more detailed understanding, as shown in Table 2.
C. Diagram Generation The above two sections are focused on the elements in the package. This section is to classify the packages by their relations. There are five different kinds of relationships between classes in Object-Oriented programs [13], which include inheritance, association, aggregation, dependency and generalization. In object-oriented programming, inheritance is a way to establish Is-a relationship between classes or objects. Association defines a relationship between classes of objects that allows one object instance to cause another to perform an action on its behalf. Aggregation is a special case of combination. However, when we group various classes into different packages, we usually need to follow three rules as introduced above. The classes with inheritance and combination relationships should be included in the same package. And we do not consider inheritance and aggregation relationships between classes in the same package since the classes with inheritance and aggregation relation are grouped in a package. Therefore, we will divide the packages into three types by their relationships between classes. The first is dependency. It is used to reflect the relationship between independent model elements and non-independent model elements. From this relationship, developers can know which independent package is used by the non-independent package. The second is use relationship used to reflect that a package may use some elements in another package. The final is generalization. It is used to reflect the inheritance relation between classes in different packages. Sometimes, one package has a generalization relationship with several packages. Before we study the classes in the package, we need to do some filtering and sorting on the packages as the following three steps. 1) Filter the classes by their access level. We ignore the private and protected level, because the classes at these two levels cannot be accessed by external
3DFNDJHQDPHFRPH[DPSOH3URGXFWV /DEHO(QWLW\BHQWLW\ 7H[W3DFNDJHFRPH[DPSOH3URGXFWVLVFODVVLILHGDV(QWLW\ ,WFRQVLVWVPRVWO\RIHQWLW\FODVV
Fig. 1. Label identification of a package 3DFNDJHQDPHFRPH[DPSOH3URGXFWV /DEHO(QWLW\BHQWLW\ &ODVVHV3URGXFW,QIR%UDQG 6XPPDUL]DWLRQ
Fig. 2. Classes and contents selection of a package TABLE II. Software historical repositories Package Name
Label
com.example.Products
Entity entity
Selected classes ProductInfo
Content description Mutate and access attribute Name,Style,Baseprice.
Brand
Mutate and access attribute Code, Name,Logo. Be used in class ProductInfo.
packages. In addition, this operation will reduce the number of elements in the package, which can improve the efficiency. 2) Sort the packages by their labels. When we confirm the relationships between two packages, we exclude them in our collection to avoid the loop back. We need to ensure the relationships between two packages is unidirectional. 3) According to the sorting results in Step 2), we establish tables for each package to record the count of different types. We start from Boundary2 . After filtering and sorting, we need to analyze the relationships. Firstly, our approach scan the classes one by one and record which packages are used and which methods are called. Take Table 3 as an example, package A uses four classes in package B. And there are six classes in package B. In addition, Package A uses the most number of elements in package B. So we can establish the ’use’ relationship between them. After that, we get the relationship and generate the relation diagram as a result. The result is that package X uses package A, as shown in Fig. 3. However, this table can only identify the dependency relationship. For generalization, we still use the same operation and count up the number of the classes which extend package X’s classes. In addition, we should not count the API’s package and classes, because they are not very useful 2 The sequence is based on the complexity of package. We start from the most complex one.
Fig. 3. Package X uses the package A for understanding the system [12], [20]. After that, what we should do is to combine the extracted information to generate the diagram for the software system, as shown in Fig. 4 and Table 4.
III. Related Work Program comprehension is a prerequisite during software maintenance and evolution. If the comments and documents of the system are missed or outdated, developers get more difficult to understand the unfamiliar code. Therefore, to gain a general understanding of the software system, it takes developers too much time to read and comprehend the code. There are an amount of studies investigating into this problem [2], [18], [7], [4], [3], and program summarization is currently an effective approach for program comprehension. Summarization of natural language documents has been widely investigated in text retrieval, natural language understanding, and cognitive psychology [10]. Recently, the above mentioned techniques have just applied in the source code summaries [8], [9], [14], [17]. The summarization often focuses on
TABLE III. An example Package Name A
B
Classes used Class a Class b Class c Class d Class e Class f
Use account 2 1 1 3 1 2
Proportion
Total classes
60%
5
23%
13
Fig. 4. Diagram generation of the packages TABLE IV. Label and content generation of the packages Package Name
Label
F
Entity entity
A
Boundary doundary
Selected Classes ProductInfo Brand ProductService StyleService
Content Description Mutate and access attribute Name,Style,Baseprice. Mutate and access attribute Code, Name,Logo. Be used in class ProductInfo. Save brand entity. Save visible productstyle
the distribution of the elements in the source code. So we believe that the distribution of different stereotype classes in the package can determine the intent and content of the whole package. Dragan used method stereotype distribution as a signature descriptor for software systems [5]. Sridhara preformed some studies to automatically generate the comments for methods and classes in ObjectOriented Programs [22], [15]. Their approaches were focused on the single source code file and did not consider the dependency between program elements. In this paper, we used the distribution of program elements to describe the whole software system, and generate the descriptions for the packages in the system. In addition, we take into account the dependency between program elements to understand the package. Our approach not only generates
the natural language summaries to describe the elements in Java package, but also generates a diagram for the whole software system to show the dependency between program elements.
IV. Conclusion and Future Work This paper proposed a novel approach to understand the package with the description of the packages and the package diagram to show the relationships between packages for a general understanding of the whole system. The generated description and diagram can allow developers to understand the main intent and structure of the system. In the future, we plan to conduct some empirical studies on a wider range of open-source-code Java system. And
we will invite some developers to evaluate our system. In addition, the result of identifying the package relationships with other external packages is still not very clear. Moreover, the number of unidentified package accounts for a proportion of the total package. So we will further develop techniques to solve this problem for wider and more effective application of our approach.
Acknowledgment The authors would like to thank anonymous reviewers who make the paper more understandable and stronger. This work is supported partially by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China under Grant No. 13KJB520027, partially by the Innovative Fund for Industry-Academia-Research Cooperation of Jiangsu Province under Grant No. BY201306310, and partially by the Cultivating Fund for Science and Technology Innovation of Yangzhou University under Grant No. 2013CXJ025.
References [1] N. Alhindawi, N. Dragan, M. L. Collard, and J. I. Maletic. Improving feature location by enhancing source code with stereotypes. In ICSM, pages 300–309, 2013. [2] T. J. Biggerstaff, B. G. Mitbander, and D. Webster. The concept assignment problem in program understanding. In Proceedings of the 15th international conference on Software Engineering, pages 482–498, 1993. [3] D. Binkley, D. Lawrie, E. Hill, J. Burge, I. Harris, R. Hebig, O. Keszocze, K. Reed, and J. Slankas. Task-driven software summarization. In 2013 IEEE International Conference on Software Maintenance, pages 432–435, 2013. [4] X. Dong and M. W. Godfrey. Identifying architectural change patterns in object-oriented systems. In The 16th IEEE International Conference on Program Comprehension, ICPC 2008, pages 33–42, 2008. [5] N. Dragan, M. L. Collard, and J. I. Maletic. Using method stereotype distribution as a signature descriptor for software systems. In 25th IEEE International Conference on Software Maintenance (ICSM 2009), pages 567–570, 2009. [6] N. Dragan, M. L. Collard, and J. I. Maletic. Automatic identification of class stereotypes. In 26th IEEE International Conference on Software Maintenance (ICSM 2010), pages 1–10, 2010. [7] A. Guzzi, L. Hattori, M. Lanza, M. Pinzger, and A. van Deursen. Collective code bookmarks for program comprehension. In The 19th IEEE International Conference on Program Comprehension, pages 101–110, 2011. [8] S. Haiduc, J. Aponte, and A. Marcus. Supporting program comprehension with source code summarization. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering, pages 223–226, 2010.
[9] S. Haiduc, J. Aponte, L. Moreno, and A. Marcus. On the use of automated text summarization techniques for summarizing source code. In WCRE, pages 35–44, 2010. [10] K. S. Jones. Automatic summarising: The state of the art. Inf. Process. Manage., 43(6):1449–1481, 2007. [11] A. J. Ko, B. A. Myers, M. J. Coblenz, and H. H. Aung. An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks. IEEE Trans. Software Eng., 32(12):971–987, 2006. [12] W. Maalej and M. P. Robillard. Patterns of knowledge in api reference documentation. IEEE Trans. Software Eng., 39(9):1264–1282, 2013. [13] S. MicroSystems. Annotations (the java tutorials ¿ learning the java language), 2008. [14] L. Moreno and J. Aponte. On the analysis of human and automatic summaries of source code. CLEI Electron. J., 15(2), 2012. [15] L. Moreno, J. Aponte, G. Sridhara, A. Marcus, L. L. Pollock, and K. Vijay-Shanker. Automatic generation of natural language summaries for java classes. In IEEE 21st International Conference on Program Comprehension, ICPC 2013, pages 23–32, 2013. [16] L. Moreno and A. Marcus. Jstereocode: automatically identifying method and class stereotypes in java code. In ASE, pages 358–361, 2012. [17] L. Moreno, A. Marcus, L. L. Pollock, and K. Vijay-Shanker. Jsummarizer: An automatic generator of natural language summaries for java classes. In ICPC, pages 230–232, 2013. [18] L. L. Pollock, K. Vijay-Shanker, D. Shepherd, E. Hill, Z. P. Fry, and K. Maloor. Introducing natural language program analysis. In Proceedings of the 7th ACM SIGPLANSIGSOFT Workshop on Program Analysis for Software Tools and Engineering, pages 15–16, 2007. [19] P. C. Rigby and M. P. Robillard. Discovering essential code elements in informal documentation. In 35th International Conference on Software Engineering, pages 832–841, 2013. [20] M. P. Robillard, E. Bodden, D. Kawrykow, M. Mezini, and T. Ratchford. Automated api property inference techniques. IEEE Trans. Software Eng., 39(5):613–637, 2013. [21] M. P. Robillard, W. Coelho, and G. C. Murphy. How effective developers investigate source code: An exploratory study. IEEE Trans. Software Eng., 30(12):889–903, 2004. [22] G. Sridhara, E. Hill, D. Muppaneni, L. L. Pollock, and K. Vijay-Shanker. Towards automatically generating summary comments for java methods. In ASE 2010, 25th IEEE/ACM International Conference on Automated Software Engineering, pages 43–52, 2010. [23] J. Starke, C. Luce, and J. Sillito. Searching and skimming: An exploratory study. In 25th IEEE International Conference on Software Maintenance (ICSM 2009), pages 157– 166, 2009. [24] A. T. T. Ying and M. P. Robillard. Code fragment summarization. In Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, pages 655–658, 2013.