Software quality evaluation based on expert judgement - CiteSeerX

To appear in Software Quality Journal, 11, 39-55, 2003

Software quality evaluation based on expert judgement Tony Rosqvist, Mika Koskela, Hannu Harju VTT Industrial Systems, 02044 VTT, Espoo, Finland

Keywords software quality, software measurement, expert judgement

Summary A method using expert judgement for the evaluation of software quality is presented. The underlying principle of the approach is the encoding of experts’ tacit knowledge into probabilistic measures associated with the achievement level of software quality attributes. An aggregated quality measure is obtained based on preference statements related to the quality attributes. The technical objectives of the paper are •

to develop of a generic and operationally feasible measurement technique to transform the tacit knowledge of a software expert to a probability distribution depicting his/her uncertainty of the level of achievement related to a quality attribute;

•

to develop rules for the construction of a consensus probability measure based on expertspecific probability measures;

•

to derive a framework for specifying software quality strategy and for evaluating the acceptance of a software produced in a software development process;

The above technical developments are used to support group decision-making regarding •

the launch or implementation decision of a software version;

•

the allocation of resources during the software development process.

1

1. Introduction Software quality is an increasingly important topic as software technology is applied frequently in diverse technical systems. Software measurement processes supporting the evaluation of software quality is therefore an increasingly integral part in any software development process (Baumert and McWhinney 1992, Järvinen 2000, Basili and Rombach 1988, Rumbaugh et al. 1998, Fowler and Scott 1997). The motivation for software measurement is to understand, control and improve the various quality attributes of the software and the underlying development process. A prerequisite is the definition of the quality attributes and the related measures / metrics (Fenton and Pfleeger 1997, Fenton and Neil 1999, Kilpi 2001).

Previous research indicates that software quality measurement is a difficult task. For instance, it has been shown (Littlewood and Strigini 1993, Butler and Finelli 1993) that empirical reliability growth - models (Keiller and Miller 1991) are unrealistic for proving compliance to ultra-reliability (< 10-7 failures per hour) requirements of software. In addition, the pooling of experience data from different applications is questionable, as the operational profile (Musa 1993) varies over applications.

The use of Commercial-Off-The-Shelf (COTS) software products as system components has shown a substantial increase during the last few years. Therefore there is also a growing interest in estimating system reliability from the reliability of its constituent components. As a system component, one component may call another component, which may not sufficiently satisfy the system requirements. The use of COTS software components may thus limit the component independence assumption of basic reliability models (Mason and Woit 1998). Furthermore, the increasing pace of COTS software product development limits the

2

portability of quality-related COTS data between COTS product generations. In other words, it may be difficult to acquire representative data sets for the parameter estimation of reliability models of software with embedded COTS components.

In short, the evaluation of software quality is made difficult by the uniqueness of the software product in the development and the use environment.

Perspectives to software quality are presented by Garvin (1984): a product view, manufacturing view, user view, value for money view, and transcendental view. For example, the manufacturing view focuses on product quality during production and after delivery, and advocates conformance to process rather than to specification. The perspective taken influences how quality is defined, and how software measurement should be addressed.

In the paper we will not adopt any specific perspective to software quality. A framework for software measurement and evaluation, based on expert judgement, is introduced that supports the software developer and/or assessor to arrange the quality control of the software development process and the software product. We will look at software measurement as an integral part of a periodic V&V activity that tracks the engineering of quality in a software development process.

In section 2 the hierarchical representation of software quality is reviewed and the building blocks of quality strategy defined. In section 3, a subjective achievement level metric is defined, as well as construction rules for a consensus achievement level metric. Acceptance rules are also proposed and discussed. In section 4, measurement processes are shortly discussed from the point of view of utilising expert judgement. In section 5, an example is worked out. Scientific principles guiding the use of expert judgement are briefly reviewed in section 6. Section 7 finishes the paper with conclusions.

3

2. Software quality strategy NOTATION qF j

Level of quality factor Fj , j = {1,…,J}

qF

Level of software quality (related to a three-level software quality hierarchy)

wFj

Weight of quality factor Fj

xk wk

Level of quality attribute xk , k = {1,…,K} Weight of qualty attribute xk

X ki

Subjective Achievement Level of expert i related to attribute k, random variable i k

pdf ( x ) Probability density function related to X ki

Xk

Consensus Achievement Level related to attribute k, random variable

pdf ( xk ) Probability density function related to X k rF , rF j Risk levels adopted for the software quality and the quality factors, respectively

z

Application-specific boundary value

2.1 Quality factor hierarchy, attributes and metrics Software quality can conceptually be decomposed to quality factors and quality attributes. The three-level Quality Factor Hierarchy (McCall 1994) in Fig. 1 illustrates the structuring of quality factors into quality attributes1, which in turn are related to various measures or metrics. For analytic clarity, the attributes should be non-overlapping with respect to the quality factors, but this is in practice almost impossible to achieve for quality objectives stated in broad terms. Therefore, some attributes will be common for a certain quality factor in the hierarchy (see Fig. 1).

The measures or metrics are typically related to physical observations. In the case of deterministic quality attributes the measure is, in principle, a binary indicator representing a yes/no – judgement to the question is the requirement associated with the attribute satisfied, for instance, is a fail-safe design principle followed. In practice, there may be different

1

The word ‘attribute’ is used instead of the McCall’s original ‘criteria’ to avoid confusion with acceptance criteria and to emphasise the generality of the hierarchic representation of software quality.

4

interpretations related to the degree to which the principle has been followed rendering the yes/no – measurement too coarse.

In the case of probabilistic attributes, physical observations are used for statistical inference of dependability attributes such as reliability. A possible problem associated with statistical inference is the lack of representative data. This may motivate the use of expert judgement as ‘observations’ in the software measurement.

To address the challenges described above, we will introduce a generic metric that represents the subjective perception of a software assessor regarding the achievement level of both deterministic and probabilistic quality attributes in section 3.1.

It is important to note that all the quality attributes are properties the more of which we achieve, the better. This will, however, not happen without some investment costs. Thus we may expect the quality attributes to compete over the available resources, but inherently, they are not in conflict or contradictory.

Several research papers exist where quality factors and their attributes are defined (McCall 1994, ISO9126 1991, Tervonen 1996). These definitions of quality factors are good references and starting points for defining a software quality model that lies in the core of any software quality strategy.

2.2 Software quality strategy Denote the software quality level of a quality factor F by qF . The level qF will obviously be dependent of the achievement levels xk of the related quality attributes k

5

° {1,…,K}.

The

simplest way of defining the quality level in quantitative terms is the weighed arithmetic average K

qF ( x1 ,.., x K ; w1 ,..., wK ) = ∑ wk xk

(1)

k =1

where the weights wk depict the relative importance between the attributes k. An approach to derive the weights by pair-wise comparison of the quality attributes is outlined in Appendix A. Eq. (1) also depicts a re-scaling of direct measurements of xk’s to an indirect quality level metric qF for the quality factor F.

We can generalise the single-factor software quality model given by Eq. (1) to a multi-factor software quality model given by Eq. (2) J

qF ( qF1 ,.., qFJ ; wF1 ,..., wFJ ) = ∑ wF j qFj

(2)

j =1

where the weights wFj denote the relative importance of the quality factors Fj, and

qF

denotes the software quality level.

Associate with each quality factor Fj a risk level rF j , and with the overall quality factor F a risk level rF . A software quality strategy may now be formulated by specifying a) a singlefactor or a multi-factor software quality model (including the weights), b) risk levels which are used as acceptance criteria, and c) acceptance rules, according to which the decisionmaker(s) can evaluate whether the software is acceptable, or not.

In section 3.2, two acceptance rules are defined that take into consideration the uncertainty in the experts’ judgements related to the achievement levels of the quality attributes in the quality model. The acceptance rules combine the quality measurement and the acceptance criteria in a probabilistic statement, the truth value of which is the outcome of the software quality evaluation. 6

3. The Subjective Achievement Level - method 3.1 Expert judgement related to quality attributes

The basic motivation for the development of the Subjective Achievement Level – method is the argument the assessment of the compliance of a software product with stated requirements is, in many cases, subjective and dependent on the assessor’s past experience. In other words, each expert has his/her own mental model of the quality of a software version based on the artifacts obtained during the software development process and the experience to interpret these. This suggests that the evaluation of software quality may be based on expert group decision-making rather than empirical evidence alone.

According to Kelly (1955), the formation of a mental model is a result of a cognitive search process (search of knowledge primitives, signs) where the characteristics of an object are mapped within construct systems. A construct system has a finite number of constructs, each defined by a positive and a negative pole such as , , , , etc.

The authors argue that a software expert (developer/assessor) is capable of expressing his opinion on the achievement level of a quality attribute based on the mental model of the software. In our case, the search process can be ignited and directed according to the goal oriented metric approach GQM (Basili and Rombach 1988). Furthermore, the achievement level can be viewed as a random variable, and the uncertainty related to it expressed by a probability distribution function.

7

Specifically, we will introduce the definition of the metric Subjective Achievement Level depicting the perception of a software developer or evaluator regarding the compliance of software quality with given quality requirements:

Definition. Subjective Achievement Level (SAL). The metric Subjective Achievement Level is a direct and a subjective measure of an expert’s perception of the maturity of software to satisfy given requirements, and is defined by the range [0, z] of real numbers, where z is an arbitrary real number, such that z > 1. Specific values are interpreted such that ‘0’ means ‘no achievement at all’, ‘1’ means ‘full achievement’ and ‘z’ means ‘the requirement is exceeded by (z-1)*100%’. Intermediate values depict intermediate achievement levels. A particular SAL measurement related to expert i and quality attribute k, is denoted by X ki ( X ki

°>0, z @ i, k ), and

is obtained by

expert elicitation. Rather than expressing a single value the expert expresses the uncertainty associated with the value by a probability density function (pdf) pdf ( xki ) . The measurement scale is a ratio scale.

Uncertainty is expressed using the Triangular-distribution, which can be defined in the range [0, z] by three parameters as shown in Fig. 2. Thus, denote the probability density function of the SAL of expert i related to quality attribute k by pdf ( xki )

Triang (aki , mki , bki ) , where

aki ≤ mki ≤ bki ∈ [0, z ] . The reasons for using the Triangular distribution are twofold; Firstly it is

definable for an interval I

¯ [0, z] exactly, Secondly, the mode of the pdf is given by one of

the parameters defining the pdf, rendering the specification of the pdf easy, even without graphical support.

The generic situation of resolving uncertainty as the number of software development cycles increases, is depicted in Fig. 2. For an ideal software development process, as the number of 8

cycles increases, the pdf ( xki ) shrinks to a peak at value one, depicting the perception of expert i that the requirement related to quality attribute k, is satisfied, i.e. Pr{xki 1} 1 .

3.2 Expert consensus and aggregation of judgements In the case, where several experts elicit their SALs of a quality attribute we would like to aggregate the experts’ SALs in a way that reflects the level of consensus with respect to the maturity of the particular quality attribute to satisfy given requirements. The following definition is introduced:

Definition. Consensus Achievement Level (CAL). The metric Consensus Achievement Level is an indirect measure obtained by applying a construction rule associated with a certain consensus level agreed upon by the experts. A particular CAL related to attribute k is denoted by Xk. Four consensus levels are described in Table 1 depicting four different construction rules for the pdf of the CAL of a quality attribute.

The starting point for the construction would be the experts’ initial SALs related to a quality attribute. After a review of these in an expert group meeting, the level of consensus, according to Table 1, would be discussed and agreed.

The consensus level I would entail a shared perception that the requirement related to a quality attribute has been reached. This would be interpreted as attaching a crisp value of one to the corresponding CAL and the descriptive label ‘adequate’ to the consensus level.

If such a shared perception cannot be reached due to uncertainty, experts discuss whether mathematical aggregation based on arithmetic averaging would best represent consensus in the group. In our nomenclature, this would correspond to consensus level II with the label 9

‘averaging’. Technical details of mathematical aggregation of probability distribution functions are found in Cooke (1991).

Experts may feel reluctant to an aggregation based on averaging, if judgements are very disparate, and rather agree on keeping the extremist SAL values found in the group. We would then define the parameters of the Triangular pdf related to the CAL Xk according to ak = min {aki, i}, mk = min {mki, i} and bk = max {bki, i}. Here we take the most pessimistic mode found in the pdf´s of the SALs elicited from the experts. This would correspond to consensus level III with the label ‘extremist’.

If the uncertainties are significant, it is natural to expect that some experts are not confident with the achievement level for some quality attribute at all and will not engage in a SAL measurement. This situation would correspond to consensus level IV representing the right to ‘veto’.

The argument that uncertainty decreases, when changing from consensus level III to level II (see Table 1), is based on an information theoretic result, reviewed in Appendix B, stating that the decision-maker should adopt a pdf obtained by arithmetic averaging of the experts' pdf's to minimise the sum of surprises when his pdf is replaced by the experts' pdf's.

The main objective of the SAL method is to identify the most critical discrepancies in the experts’ judgements, and the most critical SAL or CAL measures with respect to the quality level. Once this is done, arguments related to the critical SAL or CAL measures are brought to the focal point of further discussion and elaboration of additional action to be taken during the next iteration of the software development process. In general, the obtained consensus levels are an indication of the lack or presence of shared knowledge about the achievement level of a quality attribute. It may happen that experts’ wide, uninformative, pdf’s of the SALs 10

related to an attribute coincide, but this is not interpreted as a high consensus level with respect to the unknown achievement level that is the subject of the measurement. Fig. 3 illustrates the roles of expert judgement and consensus rules in obtaining the pdf of a CAL related to quality attribute k, i.e. pdf(xk).

3.3 Acceptance rules Acceptance rules can be defined based on the pdf’s of the CALs Xk, k = {1,…,K}, i.e. the pdf ( xk ) . The acceptance rules relate to decision problems such as ‘Is the software ready for

launch?’ and ‘Is the software ready to be implemented in the system?’ The following rules are tentative and reflect specific quality strategies.

Definition. The risk averse acceptance rule. The risk averse acceptance rules states that ‘The software is accepted if consensus level I has been reached for all quality attributes in the software quality model. This rule reflects a quality strategy that is risk averse in the sense that the software is not accepted unless all the experts share the perception that the requirements with respect to all quality attributes are satisfied. This decision rule is therefore disjunctive.

Definition. The risk tolerant acceptance rule. The risk tolerant acceptance rules states that ‘The software is accepted if the conditions P qF

P qF j

1 r 1 r

F Fj

(3)

j

are satisfied’ (multi-factor software quality model). Risk levels rF (in practice rF , rF j

°[0,1]

and/or rF j

°[0,1]

1 ) imply the tolerable risk of accepting an unqualified software version.

In other words, the software is accepted even if there exists some uncertainty related to its

11

compliance with given quality requirements (requirements related to overall quality level or requirements related to individual quality factors). This decision rule is compensatory in the sense that a high achievement on one attribute can compensate a low achievement on some other attribute for a quality factor or the software quality to be at an acceptable level.

In practice, a company-specific software quality strategy may include a consistent combination of the above acceptance rules. Furthermore, it may be motivated to set the maximum allowable value of the parameter z to e.g. 1.2, that is, the maximum relative achievement level of any quality attribute that an expert can give in his/her judgement (see section 3.1).

It is important to note, that the more we end up with CALs of quality attributes representing significant uncertainty, the less likely it is that acceptance rules will be satisfied. Therefore, resources should be allocated during the software development process in a way that supports the achievement of high consensus levels. To recapitulate, consensus is aspired with respect to the unknown achievement level, not the degree of uncertainty i.e. the form of the pdf of a CAL.

Scenario analysis (Palisade 1996) reveals those SALs and CALs which significantly influence certain values of the quality level qF (or qF in a single-factor quality model). In particular, we are interested in those SAL and CALs, which influence the lowest values of the quality level. These CALs are coupled with uncertainties that may be resolved at first, given that no consensus IV levels exist. The most influential CALs are the primary targets for the elaboration of a higher consensus level during the next software development cycle. Obviously, if consensus level IV is attained during the expert elicitation, it is clear that a SAL measurement of the related quality attributes is premature. Thus, immediate efforts should be

12

allocated to the clarification of the definitions of and the requirements related to these attributes.

4. Software measurement processes In section 3, the measurement of software quality, based on the metrics SAL and CAL, was described. Basically, stronger consensus is aspired during the expert group meeting, but more importantly, different arguments are exchanged and new insights might emerge changing the experts’ perceptions on the quality of the software. Those experts or expert groups, who are the most influential with respect to a software rejection outcome, are identified, and can be asked to provide ideas on how to progress during the next software development cycle.

When should the SAL - method be applied during the software development process? What support material, such as used in the goal oriented metric approach GQM, should be used? Is it enough to use members of the software development team or should an independent V&V team be called? Should feedback from end users be requested? Such questions are relevant when defining the software measurement process. The answers will depend on the software company policy, liability issues, etc. It suffices to say that combining features of expert processes (NRC 1989) and software measurement processes (Baumert and McWhinney 1992, Järvinen 2000, Basili and Rombach 1988) may provide novel and innovative answers to the above questions. Issues related to the selection of experts, training in basic probability theory, timing of measurements, checklists, IT support in expert elicitation and documentation, etc. have to be addressed to make any software measurement process, relying on expert judgement, workable and credible.

13

5. The ‘ISO9126’- example Assume that the quality strategy for a software is defined based on the five first characteristics of the ISO 9126 – software product evaluation standard i.e. ‘Functionality, ‘Reliability’, ‘Usability’, ‘Efficiency’ and ‘Maintainability’. Assume further that experts are requested to give their SAL judgements of these characteristics only (a one level quality factor hierarchy keeps the example simple). Assume also that the weights wFj of the characteristics are obtained by the use of the pairwise comparison technique delineated in Appendix A.

The CAL related to the Quality Level qF , is obtained from Eq. (2) as a function of the CALs qFj related to the five characteristics. It is important to keep in mind that the CALs are

modelled as random variables and the assessment of the pdf’s may require Monte Carlo simulation.

Three experts give their judgements on the achievement levels of the five software quality characteristics according to Table 2. In the case of ‘Functionality and ‘Reliability’ all experts consider the requirement to be satisfied, yielding the consensus level I for the respective CALs. For the other characteristics some experts perceive that the requirements are not met with certainty. In the case of ‘Maintainability’ the consensus level II is reached, whereas in the case of ‘Usability’ and ‘Efficiency’ consensus level III is reached. The resulting CAL for ‘Maintainability’, as obtained by arithmetic averaging (see Appendix B), is shown in Fig. 4. The reached consensus levels and the corresponding pdf’s of the CALs are shown in Table 3.

Assume that we have a risk level set at rF

0.10 for the risk tolerant acceptance rule. Now

the rule gives a positive ('TRUE') evaluation outcome for the Quality Level as shown in Fig. 5. On the other hand, if our risk level is rF

0.05 , the rule would give a negative ('FALSE')

outcome. Note that due to the compensatory nature of the decision rule, a CAL with the 14

possibility of a low quality level of ‘Maintainability’ (Fig. 4), does not prohibit a positive evaluation outcome in the former case.

Similarly, individual risk levels could have been set for any or all of the characteristics. Apply, for instance, the above risk levels for each quality factor in turn. For the quality factor ‘Maintainability’ the evaluation outcome would be negative for both risk levels.

Scenario analysis (Palisade 1996) may be adopted to find out those SAL and CAL judgements that are most influential for Quality Level values corresponding to specified percentiles of the simulated pdf or histogram of the Quality Level. More specifically, Scenario Analysis tells you which judgements are behind the lowest realisations of the Quality Level. Table 4 shows the most influential SAL and CAL judgements and their significance, when the percentiles are 5% and 10%. The significance is defined as the ratio of the difference between the conditional median and the whole simulation median of a characteristic, and its standard deviation. Let MD/SD denote this significance measure. The bigger the absolute value of the MD/SD, the more significant is the SAL or CAL of the characteristic to the Quality Level values within the respective percentiles. The conditional median is calculated from those samples that are related to the given Quality Level percentile. These values represent a ‘Quality Level scenario’ (see Table 4).

According to Table 4 we can deduce that the CAL measures for both ‘Efficiency_C3’ and ‘Maintainability_C2(Exp1)’ are the most influential for Quality Levels qF within the 5% percentile. For Quality Levels within the 10% percentile, the influences start to get mixed and only low levels for ‘Efficiency_C3’ are deemed significant. The results can be interpreted in terms of lack of information; lack of shared information is critical for 'Maintainability', and efforts aiming at a higher consensus level for the CAL of 'Maintainability' are motivated, especially as the quality requirement related to ‘Efficiency’ is achieved according to the 15

experts’ judgements. Efforts to increase a shared understanding may be specified by Expert 1, whose judgements are critical with respect to the evaluation outcome. Alternatively, could the ‘Efficiency’ - requirement be exceeded and ‘traded off’ with an inadequate achievement level related to ‘Usability’, which strategic weight is much lower compared to ‘Efficiency’? The answer obviously depends on how strict the adherence is to the defined acceptance rule.

6. Discussion The methodology of the use of expert judgement is basically subject to the same scientific principles as any other scientific methodology used to extract data from the physical world. Principles governing the use of expert judgements are (Cooke 1991): •

Reproducibility It must be possible for scientific peers to review and reproduce the calculations based on experts’ judgements

•

Accountability The source of an expert’s judgement must be identifiable

•

Empirical control The performance of the experts i.e. the goodness of their probability judgements, can be empirically evaluated

•

Neutrality The method used to elicit expert judgement should encourage experts to state their true opinion

•

Fairness All experts are treated equally a priori

The challenges for any method using expert judgement in software quality assessment relate to the empirical control and neutrality principles. The other principles should not represent significant problems. In the case of empirical control, the performance of an expert can be assessed and quantified with respect to calibration (‘accuracy’) and entropy (‘precision’) (Cooke 1991). This requires a long track record of predictions made by the expert and the actual outcomes of the events considered.

16

In the case of neutrality, it is a challenge for the experts to maintain objectivity under pressure such as software development project deadlines. In this respect, the use of independent reviewers or experts from an independent V&V team may avoid the problems of motivational bias in the judgements. The assessors work can be supported pre-defined list of questions according to the goal oriented metric approach (GQM). It has to be emphasised that whereas GQM uses different metrics, the SAL method uses only one generic metric.

7. Conclusions The basic assumption of the proposed software measurement method is that software experts can give coherent judgements on achievement levels related to various quality factors and attributes. The Subjective Achievement Level (SAL) metric is defined, which is used to quantify the experts’ judgements. Uncertainty is expressed by a Triangular probability density function. The Consensus Achievement Level (CAL) metric is defined based on construction rules reflecting the level of consensus in the group of experts. The overall quality level can be computed using a linear transformation or re-scaling together with weights depicting the relative importance of the factors and/or attributes.

Acceptance rules are introduced to

support decision-making regarding the launch or implementation of developed software. The software measurement method supports the definition of software quality strategy formulated by •

a single- or multi-factor quality hierarchy

•

a linear quality model (re-scaling and weights)

•

weights depicting the relative importance of quality factors and quality attributes

•

requirement levels associated with the quality factors

•

acceptance rules combining the software measurement with the requirements to yield a ‘go’/ ‘no go’ software launch decision

More importantly, the proposed measurement method supports a systematic identification of discrepancies in the experts’ judgements, and the identification of the most influential expert judgements determining the software evaluation outcome. Scenario Analysis suggests where

17

resources should be allocated during the next software development cycle to resolve the most significant lack of shared knowledge among the experts. The developed software measurement approach can thus be viewed as a support method for group decision-making.

Finally, the described approach to quantitative software qualification does not claim that the lack of hard evidence obtained during a software development process can be compensated for using expert judgements. Any evidence that influence the formation of a truthful mental model of the software in the mind of the expert should be searched and used.

Acknowledgements

The authors would like to express their gratitude to the three anonymous reviewers for their stimulating comments that helped to improve the paper.

18

Appendix A A software development strategy with respect to a quality factor is partly manifested by the relative importance of the quality attributes related it. The relative importance can be derived on a ratio scale by mutual comparisons of the quality attributes under the common quality factor. In the Analytic Hierarchy Process (AHP) (Saaty 1994) the set of all such comparisons can be represented in a square matrix the elements of which denote the strength of preference of an attribute in the left column over an attribute in the top row. The judgement reflects the answer to two questions: Which of the attributes is more important with respect to the quality attribute, and how strongly, using the 1 – 9 scale in Table B. For a set of n quality attributes one needs n(n-1)/2 comparisons because there are n 1’s on the diagonal due to comparison of attributes with themselves and for the remaining judgements half are reciprocals. An example of a comparison statement would be 'confidentiality is very strongly more important than reliability', yielding the value 7 in the corresponding cell, of the comparison matrix).

Table B. Preference scale of the AHP. Strength of preference 1 3 5 7 9 2,4,6,8

Definition Equal importance Moderate importance Strong importance Very strong importance Extreme importance Intermediate levels of strength

Take the ‘Dependability’-example and assume that the following comparison matrix expresses the software development strategy:

19

F F

1

E

R

M

U

R

7 1 4 1/ 9

9 1

4 1/ 2

M

7 1/ 4

2

1

1/ 5

1

1/ 3

E

U

!1

"# 5 # 1# 3## 1#$

1/ 7 1/ 4 1/ 7 1

where R, C, A, I, M stand for ’Functionality’, ’Efficiency’, ’Reliability’, ’Maintainability’ and ’Usability’, respectively.

The weights wi of the attributes are now obtained by solving the eigenvalue problem related to the comparison matrix and setting the weight vector equal to the eigenvector corresponding to the largest eigenvalue. Generally, complete consistency of the judgements is not obtained, i.e. judgements aij ≠

wi for some i,j. The consistency index for the above matrix is CI = 0.08 wj

which is considered sufficient by experience. The normalised weight vector of the above matrix, is obtained by the software Expert Choice, as w

( wF , wE , wR , wM , wU )

(0.05, 0.57 ,0.10 , 0.21 , 0.07)

20

Appendix B From an information theoretic standpoint we may define the aggregated distribution (or aggregated

probability

measure)

PC

PC ( P1 ,..., PG )

according

to

some

criterion

C G ( P1 ,...., PG ; PC ) which is a mapping

g Pg R . The number of experts is denoted by G. In

(Pulkkinen

of

I ( Pg , PC )

1993),

a

È dPg Ø E Pg ln É Ê dPC ÙÚ

mapping

is the

the

form

Kullback-Leibler

C G ( P1 ,..., PG )

divergence

G

Ç I ( Pg , PC ) ,

where

g 1

measure,

is

studied.

Especially, it is shown that solving the minimisation problem min{C G ( P1 ,..., PG )} , i.e. minimising the decision-maker’s total surprise when contrasting the aggregated probability measure PG with each expert’s probability measure Pg, amounts to taking the arithmetic average of the experts’ probability measures Pg (or probability distributions, if dPg

Pg (dx )

pdf g ( x)dx , which is usually the case).

In Monte Carlo simulation, samples corresponding to the arithmetic average distribution are obtained by first sampling an integer value g uniformly distributed in {1,...,G} where after a sample is taken from the expert-specific distribution pdf g ( x ) .

21

REFERENCES

Basili, V.R. and Rombach, H.D. 1988. The TAME project: Towards improvement-oriented software environments, IEEE Transactions on Software Engineering 14 (6): 758-773. Baumert, J.H. and McWhinney, M.S. 1992. Software Measures and the Capability Maturity Model, Technical Report CMU/SEI-92-TR-25, Software Engineering Institute, Carnegie Mellow University, Pittsburgh. Butler, R.W. and Finelli, G.B. 1993. The infeasibility of quantifying the reliability of lifecritical real-time software, IEEE Trans. Software Reliability 19: 3-12. Cooke, R. 1991. Experts in Uncertainty, Oxford University Press, Oxford. Fenton, N.E. and Neil, M. 1999. Software metrics: successes, failures and new directions. The Journal of Systems and Software, 47: 149-157. Fenton, N.E. and Pfleeger, S.L. 1997. Software Metrics – A rigorous and practical approach, 2nd ed., PWS Publishing Company, Boston. Fowler, M. and Scott, K. 1997. UML Distilled- Applying the standard object modelling language, Addison-Wesley Object Technology Series. Garvin, D. 1984.What does "product quality" really mean?, Sloan Management Review, Fall 1984 26 (1): 25-45. ISO 9126. 1991. ISO 9126 Information Technology – Software Product Evaluation – Quality Characteristics and Guidelines for their use, Geneva. Järvinen, J. 2000. Measurement based continuous assessment of software engineering processes, VTT Publications 426 (Dissertation), VTT Technical Research Centre of Finland. Keiller, P.A. and Miller, D.R. 1991. On the use and the performance of software reliability growth models, Reliability Engineering and Systems Safety 13: 95-117. Kelly, G.A. 1955. The Psychology of Personal Constructs - a Theory of Personality, Norton, New York. 22

Kilpi, T. 2001. Implementing a Software Metrics Program at Nokia, IEEE Software 18 (6): 72 – 77. Littlewood, B. and Strigini, L. 1993. Validation of Ultra-high Dependability for Software Based Systems, Communications of ACM 36: 69-88. McCall, J.A. 1994. Quality Factors. In Encyclopedia of Software Engineering, Ed. Marciniak, J.J., John Wiley & Sons, New York. Mason, D. and Woit, D. 1998. Software system reliability form component reliability. In Proc. of 1998 Workshop on Software Reliability Engineering (SRE'98), Ottawa, Ontario. Musa, J.D. 1993. Operational profiles in software-reliability engineering, IEEE Software 10 (2): 14-32. NRC. 1989. NUREG-1150: Reactor Risk Reference Document, Report NUREG-1150, U.S. Nuclear Regulatory Commission, Office of Nuclear Regulatory Research, Maryland. Palisade Corporation. 1996. @Risk – Advanced Risk Analysis for Spreadsheets, Palisade Corporation, New York. Pulkkinen, U. 1993. Methods for combination of expert judgements, Reliability Engineering and Systems Safety 40 111-118. Rumbaugh, J., Booch, G. and Jacobson, I. 1998. Unified Modelling Language, AddisonWesley Object Technology Series. Tervonen, I. 1996. Support for quality-based design and inspection, IEEE Software 13 (1) 4454. Saaty, T.L. 1994. How to make decisions: The Analytic Hierarchy Process, Interfaces 24 (6) 19-43.

23

Quality Factor 1

Quality Factor 2

Quality Attribute 1

Quality Attribute 2

Quality Attribute 3

Measure 1

Measure 2

Measure 3

.....

Quality Attribute N

Measure N

Figure 1. The Quality Factor Hierarchy, where a quality factor is conceptually decomposed into quality attributes and measures.

pdf ( xki ) = Triang (aki , mki , bki ) Software development cycles increase

0

aki

mki bki

xki 1

z

Figure 2. Change of the probability density function of the Subjective Achievement Level of expert i related to quality attribute k, as the number of software development cycles increase.

1

Expert i

pdf(xki)

1

0 1

pdf(xk )

pdf(xki+1)

0 Expert i+1

0

Expert Judgement

Consensus Rules

Figure 3. Expert judgement and construction rules for defining the probability density function of the Consensus Achievement Level of quality attribute k.

Histogram of the CAL of ’Maintainability’

0,8

0,9

1

1,1

1,2

Consensus Achievement Level of ’Maintainability’

Figure 4. Histogram of the Consensus Achievement Level of quality factor ‘Maintainability’ obtained by arithmetic averaging of the probability density functions of experts’ Subjective Achievement Levels related to ‘Maintainability’.

Histogram of the CAL of Quality Level Pr{QualityLevel < 1} = 9.3%

Pr{q Pr{q

0,8

F F

1} 0.10 ’TRUE ’

1} 0.05 ’FALSE ’

0,9

1

1,1

1,2

Consensus Achievement Level of Quality Level

Figure 5. Histogram of the Consensus Achievement Level of Quality Level qF and the evaluation outcomes of the risk tolerant acceptance rule for risk levels 0.10 and 0.05. The line qF = 1 corresponds to the 9.3% percentile.

Table 1. Consensus levels and construction rules for the probability density function of the Consensus Achievement Level of a quality attribute. Consensus level I

’adequate’

Construction of the pdf of the CAL Experts agree that the achievement level is obtained. This is interpreted as CAL obtaining the value of one.

II

’average’

strong consensus on unknown achievement level

Experts agree that the pdf of the CAL is obtained by weighted averaging of the experts’ pdf´s of their SALs.

III 'extremist'

Experts agree that the pdf of the CAL is obtained by taking the extremist boundary vales found in the pdf´s of the experts’ SALs.

IV 'veto'

The pdf of the CAL cannot be agreed upon, or is unanimously not worth eliciting

weak consensus on unknown achievement level

Table 2. Elicited parameter values of the Triangular probability density functions of the Subjective Achievement Levels of the different quality characteristics in the ‘ISO 9126’example. Quality Characteristic Fj

Weight wFj

Expert 1 pdf ( q1F j )

Expert 2 pdf ( qF2 j )

Expert 3 pdf ( qF3 j )

Functionality

0.05

1

1

1

Reliability

0.10

1

1

1

Usability

0.07

Efficiency

0.57

1

Triang(1, 1.05, 1.1))

Triang(1.0, 1.05, 1.1)

Maintainability

0.21

Triang(0.85, 0.9, 0.95)

Triang(0.9, 1.0, 1.1)

Triang(1.1, 1.15, 1.2)

Triang(0.95, 1.0, 1.0)

1

Triang(0.95, 1, 1.1)

Table 3. Consensus levels and probability density functions of the Consensus Achievement Levels of the different quality characteristics in the ‘ISO9126’-example. Quality Characteristic

CAL pdf ( qF j )

Consensus Level

Functionality

1

I

Reliability

1

I

Usability

Triang (0.95, 1, 1.1)

III

Efficiency

Triang(1, 1.05, 1.1)

III

Maintainability

By Monte-Carlo simulation, see Fig. 4

II

Table 4. Significance measures MD/SD related to the SAL or CAL of the characteristics in the ‘ISO9126’-example, denoting their influence to low Quality Levels within the 5% and 10% percentiles (QL5% and QL10%). Quality Level scenario

QL5%

QL10%

MD/SD (‘Efficiency_C3’)

-1.48

-1.17

MD/SD (‘Maintainability_C2(Exp1)

-0.51

non-significant

The negative sign stems from the median difference MD, which is calculated as the difference of the median of the subset of SAL or CAL samples corresponding to a specified ‘left tail’ percentile of the pdf of Quality Level, and the median corresponding to the whole simulation, which is larger.

Software quality evaluation based on expert judgement - CiteSeerX

Software quality evaluation based on expert judgement - CiteSeerX

Suggest Documents

Expert Judgement and Expert Disagreement - CiteSeerX

ISO9126 BASED SOFTWARE QUALITY EVALUATION

Unsupervised Learning for Expert-Based Software Quality ... - CiteSeerX

Unsupervised Learning for Expert-Based Software Quality ... - CiteSeerX

Expert judgement in cost estimating - CiteSeerX

Benchmark-based Software Product Quality Evaluation

Benchmark-based Software Product Quality Evaluation

Initial Framework for Software Quality Evaluation based on ISO/IEC ...

Expert Anti-Evaluation Model Based on Matter

Factors of Software Quality Evaluation - CiteSeerX

Quality Evaluation of Software Requirements Specifications - CiteSeerX

Evaluation of Software Quality

Fuzzy Based Evaluation of Software Quality Using Quality Models and ...

Complexity-Based Evaluation Of Rule-Based Expert ... - CiteSeerX

GP-based software quality prediction - CiteSeerX

Analogy-based software quality prediction - CiteSeerX

Expert judgement in cost estimating: Modelling the ... - CiteSeerX

the bearing quality inspected with a software based on ... - CiteSeerX

Comparing and relating expert judgement and ...

Standard-maintaining by expert judgement - Cambridge Assessment

Reconciling expert judgement and habitat suitability ...

Standard-maintaining by expert judgement - Cambridge Assessment

On the evaluation of the Bunch search-based software ... - CiteSeerX

On the evaluation of the Bunch search-based software ... - CiteSeerX