IEICE TRANS. INF. & SYST., VOL.E89–D, NO.4 APRIL 2006
1321
PAPER
Special Section on Knowledge-Based Software Engineering
Sizing Data-Intensive Systems from ER Model Hee Beng Kuan TAN† and Yuan ZHAO†a) , Nonmembers
SUMMARY There is still much problem in sizing software despite the existence of well-known software sizing methods such as Function Point method. Many developers still continue to use ad-hoc methods or so called “expert” approaches. This is mainly due to the fact that the existing methods require much information that is difficult to identify or estimate in the early stage of a software project. The accuracy of ad-hoc and “expert” methods also has much problem. The entity-relationship (ER) model is widely used in conceptual modeling (requirements analysis) for dataintensive systems. The characteristic of a data-intensive system, and therefore the source code of its software, is actually well characterized by the ER diagram that models its data. This paper proposes a method for building software size model from extended ER diagram through the use of regression models. We have collected some real data from the industry to do a preliminary validation of the proposed method. The result of the validation is very encouraging. key words: software sizing, entity-relationship (ER) diagram, software estimation, multiple regression model
1. Introduction Project effort and cost estimation is crucial in software industry. Overestimates may lead to the abortion of essential projects or loss of projects to competitors. Underestimates may result to huge financial losses for software vendors. It is also likely to affect the quality of projects adversely. Proper effort and cost estimation for a software project is also vital to the success on project planning and management. The estimation of software size plays a key role in the project effort and cost estimation. Line of Code (LOC) [18] and Function Points (FP) are still the most commonly used size measures adopted by existing software cost estimation models. Despite the existence of well known software sizing methods such as Function Point method [1], [11] and the more recent Full Function Point method [7], many practitioners and project managers continue to produce estimates based on ad-hoc or so called “expert” approaches [2], [9], [16]. This is mainly due to the fact that existing sizing methods require much implementation information that is not available in the earlier stage of a software project. However, the accuracy of ad-hoc and expert approaches also has much problem that results to questionable project budgets and schedules. The entity-relationship (ER) model originally proposed Manuscript received April 28, 2005. Manuscript revised September 30, 2005. † The authors are with the School of Electrical and Electronic Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798. a) E-mail:
[email protected] DOI: 10.1093/ietisy/e89–d.4.1321
by Chen [5] is generally regarded as the most widely used tool for the conceptual modeling of data-intensive systems. An ER model is constructed to depict the ideal organization of data, independent of the physical organization of the data and where and how data are used. Indeed, much requirement of data-intensive systems is reflected from their ER models that depict their data conceptually. This paper proposes a novel method for building software size model to estimate the size of source code for a data-intensive system based on extended ER diagram. It also discusses the validation effort conducted by us to validate the proposed method for building software size models for data-intensive systems written in Visual Basic and Java languages. The earlier version of this paper was published in [19]. The paper is organized as follows. Section 2 gives the background information of the paper. Section 3 discusses our observation and its rationale. Section 4 presents the proposed method for building software size models to estimate the sizes of source codes for data-intensive systems. Section 5 discusses our preliminary validation of the proposed method. Section 6 concludes the paper and compares the proposed method with related methods. 2. Background Entity-relationship (ER) model was originally proposed by Chen [5] for data modeling. It has been extended by Chen and others subsequently [20]. In this paper, we refer to the extended ER model that has the same set of concepts as the class diagram in terms of data modeling. In summary, the extended ER model uses the concept of entity, attribute and relationship to model the conceptual data for a problem. Each entity has a set of attributes each of which is an entity’s property or characteristic that is concerned by the problem. Relationships can be classified into three types: association, aggregation and generalization. There are four main stages in developing software systems: requirements capture, requirements analysis, design and implementation. The requirements are studied and specified in the requirements capture stage. They are realized conceptually in the requirements analysis. The design for implementing the requirements with the target environments taken into considerations is constructed in the design stage. In the implementation stage, the design is coded using the target programming language and the resulting code is tested to ensure its correctness.
c 2006 The Institute of Electronics, Information and Communication Engineers Copyright
IEICE TRANS. INF. & SYST., VOL.E89–D, NO.4 APRIL 2006
1322
Though UML (Unified Modeling Language) has gained its popularity as a standard software modeling language, many data-intensive systems are still developed in the industry through some form of data-oriented approach. In such an approach, some form of extended entity-relationship (ER) model is constructed to model the data conceptually in the requirements capture and analysis stages. The subsequent design and implementation activities are very much based on the extended ER model. For projects that use UML, a class diagram is usually constructed in the requirements analysis stage. Indeed, for a data intensive system, the class diagram constructed can be viewed as an extended ER model with the extension of behavioral properties (processing). Therefore, in the early stage of software development, some form of extended ER model is more readily available than information such as external inputs, outputs and inquiries, and external logical files and external interface files that are required for the computation of function points. 3. Our Observation Data-intensive systems constitute one of the largest domains in software. These systems usually maintain large amount of structured data in a database built using a database management system (DBMS). It provides operational, control and management support to end-users through referencing and analyzing these data. The support is usually accomplished through accepting inputs from user, processing inputs, updating databases, printing of reports, and providing inquiries to help users in the management and decision making processes. The proposed method for building software size model for data-intensive systems is based on the following characteristics of these systems. Next, we shall discuss these characteristics. The constituents of the data-intensive system can be classified into the following: 1) Support business operations through accepting inputs to maintain entities modeling in the ER diagram. 2) Support decision making processes through producing outputs from information possessed by entities modeled in the ER diagram. 3) Implement business logic to support the business operation and control. 4) Reference to entities modeled in the ER diagram to support the first three constituents. Since the first two and the last constituents are based on the ER diagram, as such, they depend on the ER diagram. At the first glance, it seems that the third constituent may not depend on the ER diagram. However, since a dataintensive system usually does not perform complex computation within the source code (any complex computation is usually achieved through calling pre-developed function), business logic in the source code is mainly for the navigation between entities via relationship types with simple compu-
tation. For example, for the business logic that if a customer has two overdue invoices, then no further orders will be processed, the source code for implementing the business logic retrieves overdue invoices in the Invoice entity type for the customer in the Customer entity type via the relationship type that associates a customer with its invoices. There is no complex computation involved. Therefore, it is reasonable to assume that usually, the implementation of business logic in a data-intensive system also depends on the ER diagram. As such, from the above characteristics of dataintensive systems, under the same development environment (that is, a particular programming language and tool used), the size of source codes for a data-intensive system usually depends on the extended ER diagram models the data. 4. The Proposed Software Sizing From the observation discussed in the previous section, the size of the source code for a data-intensive system usually depends on the structure and size of an extended ER diagram that models its data. Furthermore, ER diagram has been widely and well used in the requirements modeling and analysis stages. Thus, it is more suitable to base on extended ER diagram for the estimation of the size of source code for a data-intensive system. Therefore, we propose a method for building software size model based on extended ER diagram. This section discusses the method. The proposed method builds software size models through well-known multiple linear regression which is also used by other researchers in their research [13]. For a data-intensive system, the variables that characterize the extended ER diagram for the purpose of sizing its source code form the independent variables. The dependent variable is the size of its source code in thousand lines of code (KLOC) excluded commence and blank lines. Note that in this case, the extended ER diagram is implemented and is only implemented by the system. That is, the extended ER diagram and the system must coincide and have a one-to-one correspondence. As such, any source code that references or updates the database that is designed from the extended ER diagram must be included as part of the source code. In the proposed approach, a separate software size model should be built for each different development environment (that is, each programming language and tool used). For example, different software size models should be built for systems written in Visual Basic with SQL, and systems written in Java with JSP and SQL. As ER diagrams constructed in most projects in the industry do not classify relationship types into associations, aggregations and generalizations, the independent variables that characterize the extended ER diagram comprise of the following: 1) Total number of entity types. 2) Total number of attributes of all the entities. 3) Total numbers of relationships. We propose that the independent variables should be defined according to the type of ER diagram constructed
TAN and ZHAO: SIZING DATA-INTENSIVE SYSTEMS FROM ER MODEL
1323
during the requirements modeling and analysis stages. At least the data required for software sizing is readily available in the early stage of requirements analysis. The steps for building proposed software size models are as follows: 1) Independent variables identification: Based on the type of data model (a class or ER diagram) constructed during requirements modeling and analysis. 2) Data collection: Collect ER diagrams and sizes of source codes (in KLOC) of sufficient data-intensive systems. There are many free tools available for the automated extraction of source code size. 3) Model building and evaluation: There are quite a number of commonly used regression models [17]. Multiple linear regression model is considered in our research work. The size of source code (in KLOC) and the independent variables identified in the first step form the dependent and the independent variables respectively for the model. Statistical packages (e.g., SAS) should be used for the model building. Ideally, we should have separate datasets for modeling building and evaluation. However, if the data is limited, the same dataset may also be used for model building and evaluation. Let n be the number of data points and k be the number of independent variables. Let yi and yi are the real and the estimate values respectively of a project. Let y¯ be the mean of all yi . The evaluation of model goodness can be done according to the examination of the following parameters: • Magnitude of relative error, MRE, and mean magnitude of relative error, MMRE: They are defined as follows: yi − yˆ i (1) MREi = yi 1 MREi n i=1 n
MMRE =
(2)
If the MMRE is small, then we have a good set of predictions. A usual criterion for accepting a model as good is that the model has a MMRE ≤ 0.25 [8], [9], [13]. • Prediction at level l—Pred (l)—where l is a percentage: It is defined as the ratio of number of cases in which the estimates are within the l absolute limit of the actual values divided by the total number of cases. A standard criterion for considering a model as acceptable is Pred(0.25) ≥ 0.75 [8], [9], [13]. • Multiple coefficient of determination, R2 , and adjusted multiple coefficient of determination, R2a : These are some usual measures in regression analysis, denoting the percentage of variance accounted for by the independent variables used in the regression equations. They are computed as follows: R2 =
Explained var iablity Total var iablity
=
S S yy − S S E SSE =1− S S yy S S yy
and
R2a
(n − 1) =1− n − (k + 1)
SSE S S yy
(3)
(4)
where sum errors S S E = ni=1 (yi − yi )2 and n of squared S S yy = i=1 (yi − y¯)2 . In general, the larger the value of R2 and R2a , the better fit of the data. R2 = 1 implies a perfect fit of the model passing through every data point. However, R2 can only be used as a measure to access the usefulness of the model if the number of data points is substantially more than the number of independent variables. If the same dataset is used for both model building and evaluation, we can further examine the following parameters to evaluate the model goodness: • Relative root mean squared error, RMS , is defined as follows [6]: RMS =
SS E n−(k+1)
y¯ n
(5)
where S S E = i=1 (yi − yˆ i )2 . A model is considered acceptable if RMS ≤ 0.25 [6]. • Prediction sum of squares, PRESS [17]: PRESS is a measure of how well the use of the fitted values for a subset model can predict the observed responses yi . The error sum of squares, S S E = (yi − yi )2 , is also such a measure. The PRESS measure differs from SSE in that each fitted value yi for the PRESS is obtained by deleting the i th case from the dataset, estimating the regression function for the subset model from the remaining n − 1 cases, and then using the fitted re gression function to obtain the predicted yi(i) for the ith case. That is, it is defined as follows: PRESS =
n (yi − yi(i) )2
(6)
i=1
Models with smaller PRESS values are considered good candidate models. The PRESS value is always larger than SSE because the regression fit for the ith case is included. A smaller PRESS value supports the validity of the model built. 5. Preliminary Validation We have spent much effort to pursue organizations in the industry to supply us their project data for the validation of the proposed software sizing method. As such, the whole validation took about one and a half year. This section discusses our validation. The independent variables for characterizing an ER diagram in our validation is as follows:
IEICE TRANS. INF. & SYST., VOL.E89–D, NO.4 APRIL 2006
1324
1) Number of entity types (E) 2) Number of attributes (A) 3) Number of relationship types (R)
Table 1
The VB based project dataset.
Table 2
Statistical tests for VB model.
These variables provide a reasonable and concise characterization of the ER diagram. Our validation bases on the following linear regression models [15]: Size = β0 + β1 E + β2 R + β3 A
(7)
where Size is the total KLOC (thousand lines of code) of all the source code that is developed based on the ER diagram and βi (0 ≤ i ≤ 3) is a coefficient to be determined. 5.1 The Dataset We collected three datasets from multiple organizations in the industry including software house and end-users such as public organizations and insurance companies. These projects cover a wide range of application domains including freight management, administrative and financial systems. The first dataset comprises 13 projects that were developed using Visual Basic with SQL. The second dataset comprises 10 projects that were developed using Java with JSP and SQL. Tables 1 and 3 show the details of the two datasets. The first and second datasets are for the building of software size models for the respective development environments. The third dataset comprises of 8 projects developed using the same Visual Basic development environment as the first dataset. Table 5 shows the details of the third dataset.
Table 3
The Java based project.
5.2 The Resulting Models From the Visual Basic based project dataset (Table 1), the resulting model that we built for estimating the size of source code (in KLOC) developed using Visual Basic with SQL is as follows: Size = 14.737 + 1.221E + 0.548R − 0.035A
(8)
Adjusted multiple coefficient of determination R2a for this model is 0.85. The value of R2a is reckoned as good. To test the significant of the overall model and the coefficients, Ftest and t-test were done. In significance tests, (Pr > F) = 0.0001 and (Pr > |t|) for intercept, E, R and A are 0.0145, 0.0088, 0.2440 and 0.0820 respectively (shown in Table 2). It is found that the coefficients of R and A are not significant. To see whether there are correlations between independent variables, correlations of estimates test was done. rER , rEA and rRA are −0.61, −0.58 and −0.19 respectively. It shows that independent variables E and R, independent variables E and A have the correlations. From the Java based project dataset (Table 3), the resulting model that we built for estimating the size of source code (in KLOC) developed using Java with JSP and SQL is as follows: Size = 4.771 + 1.235E + 0.025R + 0.022A
(9)
Adjusted multiple coefficient of determination R2a for this model is 0.99 for this model. The value of R2a is reckoned as very good. To test the significant of the overall model and the coefficients, F-test and t-test were done. In significance tests, (Pr > F) < 0.0001 and (Pr > |t|) for intercept, E, R and A are 0.0003, < 0.0001, 0.1880 and 0.0017 respectively (shown in Table 4). It is found that the coefficient of R is not significant. To see whether there are correlations between independent variables, correlations of estimates test was done. rER , rEA and rRA are −0.71, −0.91 and 0.41 respectively. It shows that independent variables E and R, independent variables E and A have the correlations. From the results, it shows that there are correlations between independent variables in both models, VB and Java. This may have the impact on the significance of the regres-
TAN and ZHAO: SIZING DATA-INTENSIVE SYSTEMS FROM ER MODEL
1325 Table 4
Table 5
Statistical tests for Java model.
Table 6
The evaluation result of Java.
The VB based project evaluation.
sion coefficients. In the project, most of the data in the datasets are supplied by users directly. Due to the view of confidentiality, many industry organizations did not allow us to access their system documentations to verify the accuracy of the data supplied by them. Furthermore, the sizes of both datasets for model building are not sufficient. These factors also have severe impacts on the regression models. Thus, in our future work, extending the datasets will be our main work, and we will collect data only from those systems that we are allowed to verify their documentation and systems. Model selection will also be done to get better model and reduce the correlations between dependent variables. 5.3 Model Evaluation For the first order model that we built for estimating size of source code (in KLOC) developed using Visual Basic with SQL, we managed to collect a separate dataset for the evaluation of the model. Note that R2a for this model has already been computed during model building and is 0.85, which is reckoned as good. MMRE and Pred (0.25) computed from the evaluation dataset are 0.22 and 0.65 respectively. The value of Pred (0.25) is smaller than 0.75. The detailed results of the evaluation are shown together with the evaluation dataset in Table 5. For the model that we built for estimating the size of source code (in KLOC) developed using Java with JSP and SQL, we did not manage to collect a separate dataset for the evaluation of the model. As such, we used the same dataset for the evaluation. Note that R2a for this model has already been computed during model building and is 0.99, which is reckoned as very good. MMRE, Pred (0.25), SSE and PRESS computed from the same dataset are 0.03, 1.00,
11.23 and 95.65 respectively. The detailed result of the evaluation is shown in Table 6. Both MMRE and Pred (0.25) fall well within the acceptable level. Although there is a difference between SSE and PRESS, the difference is not too substantial too. Note that RMS computed from SSE in this case is 0.03. If we replace SSE by PRESS in the computation of RMS , then the value of RMS is 0.08. Both of these values fall well below the acceptable level 0.25. Therefore, the evaluation results support the validity of the model built. 6. Comparative Discussion We have proposed a method for building software size models for data-intensive systems. We do not claim that the models built in this paper are ready for use. However, at least, we believe that our work has shown some promise to study the proposed method for software sizing further. Software size estimation is an important key to project estimation, which in turn is vital for project control and management [3], [4], [12]. There is much problem in existing software size estimation methods. As the software estimation community requires totally new datasets for the building and evaluation of software size models built using the proposed method, we call for collaborations between the industry and the research communities to validate the proposed method further and more comprehensively. From the history in establishing of Function Point method, without such effort, it is not likely to succeed in building usable software size model. As discussed in [16], most of the existing software sizing methods [10], [13], [14], [21] require much implementation information that is not available and is difficult to predict in the early stage of a software project. The information is not even available after the requirements analysis stage. It is only available in the design or implementation stage. For example, Function Point method is based on external inputs, outputs and inquiries, and external logical files and external interface files. Such implementation details are not even available at the end of requirements analysis stage. Object Point (OP) can be used to estimate software size during the planning stages of Application Composition projects [4]. It
IEICE TRANS. INF. & SYST., VOL.E89–D, NO.4 APRIL 2006
1326
is an adaptation of Function Point used in COCOMO II. It is determined by counting the screens, reports and 3 GL language modules developed in the application. However, such objects are not directly related to “objects” in the objectoriented (OO) methodology, thus the Class Point (CP) is proposed to generalize the Function Point method for OO systems [8]. However, the CP method is based on the information from design documentation. ER diagram has been well used in the conceptual modeling for developing data-intensive systems. Some proposals for software projects have also included ER diagrams as part of project requirement. As such, ER diagrams are at least practically available after the requirements analysis stage. Once the ER diagram is constructed, the proposed software size model can be applied without much difficulty. Therefore, in the worst case, we can apply the proposed approach after the requirements analysis stage. Ideally, a brief extended ER model should be constructed during the project proposal or planning stage. The proposed software size model can be applied to estimate the software size to serve as an input for project effort estimation. Subsequently, when a more accurate extended ER model is available, the model can be reapplied for more accurate project estimation. A final revision of project estimation should be carried out at the end of requirements analysis stage, in which an accurate extended ER diagram should be available. The well-known Function Point method is also mainly for data-intensive systems. As such, the domain of application for the proposed method for software sizing is similar to that of Function Point method. Acknowledgments We would like to thank IPACS E-Solution (S) Pte Ltd, Singapore Computer Systems Pte Ltd, NatSteel Ltd, Great Eastern Life Assurance Co. Limited, JTC Corporation and National Computer Systems Pte Ltd for providing the project data. Without their support, this work would not be possible. References [1] A.J. Albrecht and J.E. Gaffney, Jr., “Software function, source lines of code, and development effort prediction: A software science validation,” IEEE Trans. Softw. Eng., vol.SE-9, no.6, pp.639–648, Nov. 1983. [2] P. Armour, “Ten unmyths of project estimation: Reconsidering some commonly accepted project management practices,” Commun. ACM, vol.45, no.11, pp.15–18, Nov. 2002. [3] B.W. Boehm and R.E. Fairley, “Software estimation perspectives,” IEEE Softw., vol.17, pp.22–26, Nov./Dec. 2000. [4] B.W. Boehm, E. Horowitz, R. Madachy, D. Reifer, B.K. Clark, B. Steece, A.W. Brown, S. Chulani, and C. Abts, Software Cost Estimation with COCOMO II, Prentice Hall, 2000. [5] P.P. Chen, “The entity-relationship model—Towards a unified view of data,” ACM Trans. Database Syst., vol.1, no.1, pp.9–36, March 1976. [6] S.D. Conte, H.E. Dunsmore, and V.Y. Shen, Software Engineering Metrics and Models, Benjamin/Cummings, 1986. [7] COSMIC-Full Functions—Release 2.0, Sept. 1999.
[8] G. Costagliola, F. Ferrucci, G. Tortora, and G. Vitiello, “Class point: An approach for the size estimation of object-oriented systems,” IEEE Trans. Softw. Eng., vol.31, no.1, pp.52–74, Jan. 2005. [9] J.J. Dolado, “A validation of the component-based method for software size estimation,” IEEE Trans. Softw. Eng., vol.26, no.10, pp.1006–1021, Oct. 2000. [10] D.V. Ferens, “Software size estimation techniques,” Proc. IEEE NAECON 1988, pp.701–705, 1988. [11] D. Garmus and D. Herron, Function Point Analysis: Measurement practices for successful software projects, Addison Wesley, 2000. [12] C.F. Kemerer, “An empirical validation of software project cost estimation models,” Commun. ACM, vol.30, no.5, pp.416–429, May 1987. [13] R. Lai and S.J. Huang, “A model for estimating the size of a formal communication protocol application and its implementation,” IEEE Trans. Softw. Eng., vol.29, no.1, pp.46–62, Jan. 2003. [14] L.A. Laranjeira, “Software size estimation of object-oriented systems,” IEEE Trans. Softw. Eng., vol.16, no.5, pp.510–522, May 1990. [15] J.T. McClave and T. Sincich, Statistics, 9th ed., Prentice Hall, 2003. [16] E. Miranda, “An evaluation of the paired comparisons method for software sizing,” Proc. Int. Conf. on Software Eng., pp.597–604, 2000. [17] J. Neter, M.H. Kutner, C.J. Nachtsheim, and W. Wasserman, Applied Linear Regression Models, IRWIN, 1996. [18] R. Park, Software Size Measurement: A Framework for Counting Source Statements (CMU/SEI-92-TR-20, ADA258304), Software Engineering Institute, Carnegie Mellon University, Pittsburgh, PA, Sept. 1992, http://www.sei.cmu.edu/pub/documents/92.reports/pdf/tr20.92.pdf [19] H.B.K. Tan and Y. Zhao, “ER-based software sizing for dataintensive systems,” Proc. Int. Conf. on Conceptual Modeling, pp.180–190, 2004. [20] T.J. Teorey, D. Yang, and J.P. Fry, “A logical design methodology for relational databases using the extended entity-relationship model,” ACM Comput. Surv., vol.18, no.2, pp.197–222, June 1986. [21] J. Verner and G. Tate, “A software size model,” IEEE Trans. Softw. Eng., vol.18, no.4, pp.265–278, April 1992.
Hee Beng Kuan Tan received his B.Sc. (1 Hons) in Mathematics in 1974 from the Nanyang University (Singapore). He received his M.Sc. and Ph.D. degrees in Computer Science from the National University of Singapore in 1989 and 1996 respectively. He is currently an Associate Professor with the Information Communication Institute of Singapore (ICIS) in the School of Electrical and Electronic Engineering, Nanyang Technological University. Before moving to academic, he has 13 years of experience in software systems design, development and project management. His current research interests include transformation from functional models to implementation models, software verification and testing, and software estimation. Yuan Zhao received her B.Sc in application of computer science in 1996 from XiDian University (China). After five years working of software development and systems design in industry, she is a PhD student in Nanyang Technological University currently. Her research interests include software estimation, software verification and data mining.