10th International Software Metrics Symposium - Chicago 14-16 September 2004
Assessment of Software Measurement: an Information Quality Study Michael Berry1, 2
Ross Jeffery1, 2
Aybuke Aurum, 1
[email protected]
[email protected]
[email protected]
1
National ICT Australia
2
University of New South Wales, Australia were invited to participate in trials of the targeted method with little response. The investment of 30 to 200 hours of resource effort into generating a targeted assessment was apparently not worth the value obtained. The difficulty of selling assessment of measurement (meta-measurement) is not surprising: the link between an investment in meta-measurement and the business return is attenuated and complex. We do not have the metrics and evidence to demonstrate the return from meta-measurement in a manner similar to the ROI methods being applied to investments in software process improvement [29]. Meta-measurement is a leap of faith that may be easier when only a small investment is required. The absence of meta-measurement presents a challenge for the software engineering industry. Software measurement is required in order to support managerial and engineering processes. Many authors have clearly demonstrated the importance of this support, among them [1, 3, 9, 13, 26, 28]. But without objective assessment, how is it possible to determine if software measurement is efficient and effective, and aligned with the needs of the organisation’s people and software engineering processes? Without objective assessment, it is difficult to identify the best opportunities to improve measurement. There is increasing pressure on organisations to improve measurement through evaluation. An international standard [16] and a normative process model [7] for software measurement and analysis have been published with explicit requirements to evaluate measurement for the purposes of improvement. For example: the SEI CMMI model for “Measurement and Analysis” calls for organisations at a capability maturity level of 2 (Managed) to: GP 2.8 Monitor and Control the Process GP 2.9 Objectively Evaluate Adherence GP 2.10 Review Status with Higher Level Management Organisations at a capability maturity level of 3 (Defined), and above, are expected to: GP 3.1 Establish a Defined Process GP 3.2 Collect Improvement Information. Some organisations are being directed to adopt these standards and models; for example [15]: “ISO/IEC
Abstract This paper reports on an empirical study into methods to assess the quality of the information in software measurement products where the goal is to improve the information support provided to managers and software engineers. In Phase One of the study, two measurement assessment instruments are developed and deployed in order to generate two sets of analyses and conclusions. These sets will be subjected to an evaluation of their information quality in phase two of the study. One assessment instrument was based on AIMQ, a generic model of information quality. The other instrument was developed by targeting specific practices relating to software project management and identifying requirements for information support. Both assessment instruments delivered data that could be used to identify opportunities to improve measurement. The generic instrument is cheap to acquire and deploy, while the targeted instrument requires more effort to build. Conclusions about the relative merits of the methods, in terms of their suitability for improvement purposes, await the results from the second phase of the study.
1. Introduction The goal of the work discussed in this paper is to evaluate methods by which organisations may assess the quality of their software measurement products and identify improvement opportunities. Two methods are examined: the first is based on a generic model of information quality developed by Lee, Strong, Kahn and Wang [19]. The second is based on a method demonstrated by Berry and Vandenbroek [5] in which a performance model is developed by targeting specific practices that software managers and engineers carry out. The question is: which method delivers better value for the purposes of improving software measurement: the cheap, quick method based on a generic model of information quality, or the more expensive, longer method that is tailored for specific people, contexts and processes? The motivation for the work was the apparent reluctance of the industry to take up the targeted method. Australian and international organisations
1
10th International Software Metrics Symposium - Chicago 14-16 September 2004
15939:2002, Software Engineering – Software Measurement Process, was adopted on 19- MAR- 03 for use by the Department of Defense (DoD)”. The requirement for meta-measurement within the CMMI Measurement and Analysis model is highlighted by Goldenson, Jarzombek and Rout [12], who write: “Like any other process area, measurement and analysis can progress from being performed in an essentially ad hoc manner, through following a well-defined measurement process, to using measurement to evaluate and improve the measurement process itself.” Similarly, McGarry et al in their text on Practical Software Measurement [21] devote a chapter to Evaluating Measurement covering the measures and the process. Within their focus on the operational layer of management, they state that “The objective of a measurement program is to generate information that provides insight into project information needs so that the project manager can make informed decisions”. They recommend that “The measurement program should be evaluated regularly and actions taken to continually improve it”. One criterion that they identify for evaluating measurement product is “Measurement Fitness for Purpose” which they define as the extent to which measurement results effectively satisfy the identified information need. Indicators of fitness for purpose might include accuracy, usability, reliability, timeliness, confidence, comprehensive, appropriate (relevant), understandable, and interpretability. The next section of this paper reviews current approaches to the assessment of software measurement. In subsequent sections, an empirical study of two methods to assess measurement products is described. In Phase One of the empirical study, students evaluated the quality of the information that they were provided in order to carry out a software project management assignment. The students’ feedback was captured using two different evaluation methods. Their feedback was analysed and conclusions drawn about how best to improve the quality of the information for future students.
performance can then be compared to a model of best practice, with the differences being the opportunities for improvement. In this paper we use the term meta-measurement to mean measurement of measurement for the purpose of assessment in order to improve. The object of interest for this paper is the measurement product because methods for the evaluation of measurement products are lacking. Research into methodologies for meta-measurement has enabled measurement to be better understood and improved at the organisational level [4, 11, 14, 17, 24] but there few publications dealing with the assessment and improvement of measurement at the process level. Daskalantonakis, Yacobellis and Basili [8] describe a method for assessing software measurement in a manner consistent with the SEI software process assessment methodology [23]. Fundamental to their method is the concept of measurement themes that improve over time according to a consistent, orderly, evolutionary pattern. Budlong and Peterson have augmented and generalised the Daskalontakis approach and formalised it in the Software Metrics Capability Evaluation Guide [6]. Mendonca and his colleagues [22] present a methodology for "understanding the data and the metrics and how they are fulfilling the needs of data users in an MF (Measurement Framework)". Their approach addresses the twin issues of data being collected for no good purpose and of data being unused because its existence is unknown. The method proposed by Mendonca is rigorous and effective, with the organisation making changes as result of the assessment. However, it seems resource intensive. Many organisations may find it more politically and financially feasible to adopt the incremental Plan-Do-Check-Act improvement approach incorporated in ISO/IEC standard 15939 – Software Measurement Process [16] and the SEI’s CMMI “Measurement and Analysis” process area [7]. The approaches outlined above may be designated as either factor-based or process-based and may provide only a limited view. Ultimately measurement is a matter of the relationship between human cognitive processes and measurement products and these approaches pay insufficient attention to this relationship. Rivet et al have discussed a cognitivebased approach to designing measurement frameworks [25] and Hall and colleagues [2] have evaluated client satisfaction with software measurement as part of a more general study into sentiments towards software process improvement. Unfortunately, there seem to be no validated human-centred instruments for use by
2. Assessment of measurement The goal of assessment is to characterise an object of interest with respect to a chosen model so that it may be understood at the chosen level of abstraction [10]. The chosen model expresses how the properties of the object of interest interact to provide a particular level of performance. Characterisation occurs when values are assigned to the properties of the object of interest through an act of measurement that maps empirical observations of each significant property into the chosen model. The characterisation of actual
2
10th International Software Metrics Symposium - Chicago 14-16 September 2004
practitioners to evaluate and improve software measurement. Software measurement may be characterised as an information system for collecting, analysing and communicating measures of software processes, products and services. The information system is specified, constructed, implemented, operated and maintained through a set of practices variously referred to as the Experience Factory [3], Metrics Program or Measurement Framework [22]. The system creates business value by delivering measurement product to human-beings that they will interpret as information. Evaluating measurement products requires qualitative methods. This is common to all information systems and it may be helpful to look outside the software engineering domain for a solution. Like previous approaches to the evaluation of software measurement, evaluation of information systems has also been predominately factor-based and may have been too simplistic for the complexity of the phenomenon. Sauer [27] suggests that a qualitative approach is needed to deal with the level of complexity in socio-technical systems - this may well be applicable to the evaluation of software measurement. A qualitative approach to the evaluation of information systems is offered by the work of a group of MIS researchers centred on the Information Quality program at the Massachusetts Institute of Technology [18, 19, 20]. Lee, Strong, Kahn and Wang [19] developed a model of Information Quality (IQ) that contained 15 orthogonal concepts (termed IQ Dimensions) – see Table 1. The model formed the basis for the AIMQ instrument which can be used to identify problems with an organisation’s information systems and/or to assess the systems against benchmarks. See Table 2 for some examples of the instrument test items. The AIMQ instrument was tested in an industrial setting and was shown to have acceptable reliability as measured by Cronbach’s alpha. IQ Dimension Understandability. Completeness. Appropriate Amount. Ease of Operation. Interpretability. Relevancy. Accessibility. Believability. Concise Representation.
IQ Dimension Consistent Representation. Free of Error. Objectivity. Reputation. Security. Timeliness.
AIMQ Instrument 4 items, Alpha=.83 4 items, 4 items, 4 items, 4 items, 5 items,
Alpha=.91 Alpha=.72 Alpha=.85 Alpha=.81 Alpha=.88
Table 1 – AIMQ Model of Information Quality Understandability IQ Dimension and Test Items 1. The meaning of the information was easy to understand. 2. The information was easy to understand. 3. The meaning of the information was difficult to understand. 4. The information was easy to comprehend. Table 2 – AIMQ Instrument Example The AIMQ instrument has been validated and its domain of use is appropriate to software measurement if software measurement is viewed as a type of information system. The instrument is simple to deploy and analysis of the collected data is straight-forward. The instrument appears to offer a cheap and easy method to evaluate and benchmark software measurement in an organisation. However, it appears that analyses prepared from AIMQ data may provide insufficient guidance to the people responsible for improving software measurement. A recent paper [20] by two of the developers of AIMQ states that: “This approach focuses on the process of producing data and information, rather than only on the quality of the information product produced.” In contrast, the method for targeted assessment of software measurement [5] has been specifically developed to support the improvement of the software measurement process. It focuses on the measurement support for specific software engineering processes and the practices that people must carry out. It is quite clear where any improvement can be expected to become apparent. This provides a control mechanism to evaluate the value of that improvement. A targeted assessment enables people to state their priorities for improvement. Thus clear direction is provided to the “improvers” as to what needs to be improved first and why. The 2001 trial of targeted assessment of software measurement identified both project tracking activities
AIMQ Instrument 4 items, Alpha=.90 6 items, Alpha=.87 4 items, Alpha=.76 5 items, Alpha=.85 5 items, Alpha=.77 4 items, Alpha=.94 4 items, Alpha=.92 4 items, Alpha=.89 4 items, Alpha=.88
3
10th International Software Metrics Symposium - Chicago 14-16 September 2004
and measurement practices that should be improved. The trial demonstrated feasibility and effectiveness of a targeted approach for the assessment of measurement. But this assessment was obtained at a cost of two hundred hours of resource time and only one key practice area in the CMM model was addressed. A repeat deployment of the instrument to determine if the improvements had been effective would cost at least thirty hours of resource time, assuming around thirty respondents. Similarly, use of an off-the-shelf targeted assessment instrument by an organisation would require at least thirty hours of resource time. Using a generic instrument like AIMQ is the minimum investment that an organisation can make in the assessment of software measurement. A targeted assessment requires more investment and the methods described by Budlong and Peterson [6] and Mendonca and his colleagues [22] would require the greatest investment. In the following section, we describe Phase One of a study to compare assessment of measurement products using the AIMQ and Targeted methods.
responses simulate the objective assessment that that should be the basis of an improvement plan. Research Questions The research questions to be answered by this simulation are: 1. Does Assessment of Software Measurement using an instrument based on the AIMQ model of Information Quality yield information of sufficient quality to support improvement? 2. Does Assessment of Software Measurement using an instrument based on the Targeted method yield information of sufficient quality to support improvement? 3. Which of the methods is considered superior for the purposes of improvement? The study was preceded by a pilot study and has two phases. In Phase One, measurement assessment instruments are developed and deployed and the students’ feedback is collected and analysed. In Phase Two, these assessments of measurement will be objectively evaluated. Phase Two commenced in February 2004 and will be completed by April 2004. This paper is concerned with only the pilot study and the Phase One study.
3. The Information Quality Study The AIMQ and Targeted methods were chosen to be studied because they present the lowest barriers to entry to those interested in performing meta-measurement for the purpose of improvement. Assessment of measurement products is a process of consulting the clients for measurement. It focuses on their satisfaction with the quality of information they are given. Although knowing the clients’ perceptions is only a precursor to changes that will deliver information quality that will meet clients’ expectations, it is essential. It demonstrates that the focus of the measurement framework is on the users of measurement products. It builds an alliance between the producer and the consumer that makes the measurement framework less exposed to the financial volatility and changes in business strategy that can lead to cancellation of measurement initiatives. The Information Quality Study is one of a set of studies into Assessment of Software Measurement. The study simulates an industrial situation in which the quality of a set of measurement products needs to be assessed in terms of their ability to support project management. In the study, “practicing software project managers” are simulated by students while their lecturers simulate the people responsible for the software measurement framework. The students’ software project management assignment simulates the information product provided by the software measurement framework. The analyses of the students’
3.1. The Pilot Study The purpose of the pilot study was to test the method for developing the assessment instrument, to test the web technology, and to estimate response times. An initial web-based survey instrument was developed and deployed in November 2002. This instrument evaluated a subset of six IQ dimensions from the AIMQ and also included probes relating to software project management based on the M2P work discussed in [5]. Fifty test items were used, organized into ten probes each containing five test items. The probes were presented in a random sequence to each respondent. After the ten probes, the respondent was presented with a final probe that collected data on the respondent’s attitude to the survey. Students undertaking the Masters level subject in Project Management were invited by the lecturer in charge to complete the survey. Thirteen students out of a total enrolment of 140 responded. Ten students completed all questions. Instrument Reliability The reliability of the resultant instrument (Table 3) was encouraging as the alpha values for four of the constructs were above the accepted level of 0.70. However, in order to improve the reliability for the Phase One instrument, test items containing negatives such as “The amount of information is not sufficient for
4
10th International Software Metrics Symposium - Chicago 14-16 September 2004
our needs.” were rephrased to be positive. The present tense was changed to past tense and it was made clearer that the students’ project management assignment was the object of interest. Pilot AIMQ Test Constructs instrument alpha alpha Appropriate 0.76 0.72 Amount. Completeness. 0.87 0.71 Ease of Operation. 0.85 0.77 Interpretability. 0.77 0.56 Relevancy. 0.94 0.82 Understandability. 0.9 0.5
data from the pilot enabled us to confidently estimate respondent would require up to 10 minutes for the Phase One assessment instrument. Time to Complete an Individual Probe This analysis is used to identify probes that may be causing particular difficulty for respondents. The mean time to complete a single probe ranged from 0.6 minutes to 1.5 minutes with standard deviations ranging from 0.6 minutes to 1.4 minutes. There were no probes that stood out as particularly difficult for the respondents. Reverse coding was present in the original AIMQ instrument. This format is often used to enable internal consistency checks and to inhibit respondents from “marking down the page”. It was retained for the Phase One study, for example, “The information was difficult to manipulate to meet our needs.”. There were reservations about the value of reverse coding where respondents may be required to disagree with a negative assertion. This may particularly apply to respondents from non-English speaking backgrounds.
Table 3 – Pilot Instrument Reliability
Instrument Usability The respondents were asked their reactions to five assertions about the survey. One comment was received: “While your questions are clear some of them do not seem to be referring to the subject that I have just done. I think this is a problem with the subject rather than your survey though. I hope this feedback helps. You could also make the colours a bit better, the screen is really hard to look at.” As a result of this comment, further work was put into improving the presentation of the probes. Overall, the respondents were positive about the survey with mean ratings per item of between 4.1 and 6.9 on a 13 point scale where the lower the value: the more positive the sentiment.
Impact of the Order of Completion Two effects may be observed over a survey: one is a learning effect and the other a fatigue effect. The survey incorporated a checkpoint/restart feature and respondents were asked to stop if they felt fatigued and pick up the survey later. Only one respondent used this feature, so fatigue is unlikely to be an issue. The order of completion of probes did not appear to be a significant issue. However, it was decided to retain the practice of presenting the probes in a randomised sequence to the respondents in order to minimise the impact of the learning and fatigue effects. In the following section the conduct of Phase One of the Information Quality Study will be discussed in terms of the instruments and the subjects.
Responses Thirteen respondents contributed data that could be regarded as meaningful. Mean ratings for the information quality for each respondent on the 13 point scale ranged from 3.0 (positive) to 9.6 (negative) and the standard deviations in the set of responses for the respondent ranged from 0 to 4.7. The conclusions from the pilot are that respondents will be sufficiently discriminatory in their responses to support an analysis of the data. However, inter-rater consistency seems low, considering that all the respondents were assessing the same information set.
4. Information Quality Study: Phase One The previous section discussed the Pilot Study that enabled us to test and improve the measurement assessment instrument. This section deals with Phase One of the Information Quality study. The first part of the section discusses the preparation of the instruments used in the survey. The second part characterises the respondents. Part three discusses the impact of the respondents’ demographic and part four considers the possible impact of the respondents’ cognitive style on their responses.
Time to Complete the Assessment By using the time-stamps on responses, the mean time for respondents to complete the eleven probes was estimated to be 8.4 minutes with a standard deviation of 3 minutes. The mean time to complete a probe was 48 seconds with a standard deviation of 30 seconds. The
5
10th International Software Metrics Symposium - Chicago 14-16 September 2004
rephrasing. Definitions of the six Dimensions are provided below.
4.1. The Instruments This part discusses the instruments developed and deployed in phase one of the study. These instruments evaluated: • Respondent demographic • Generic Information Quality • Targeted Information Quality • Survey Feedback The instruments were deployed as a set of web forms using the internet. While it was anticipated that a respondent would complete the assessment during one session, a checkpoint-restart facility was retained to enable respondents to stop and return later. The complete set of web forms consisted of: 1. A single Respondent Demographic form containing 4 items. 2. Five Generic Information Quality forms, each containing 5 items and a comment box. 3. Five Targeted Information Quality forms, each containing 5 items and a comment box. 4. A single Feedback form, containing 5 items and a comment box. The demographic form was presented first, and the feedback form, last. The other forms were interspersed and presented in a randomised order to the respondents. The Respondent demographic instrument used a pull-down menu with prescribed categories. The other instruments used a 13 point Likert scale with a format in which the respondent was presented with an assertion with which the respondent was invited to register the level of their agreement (or disagreement).
relevant IQ
IQ Dimension Definition Understandability. The extent to which information is easily comprehended. Completeness. The extent to which information is not missing and is of sufficient breadth and depth for the task at hand. Appropriate The extent to which the volume of Amount. information is appropriate to the task at hand. Ease of Operation The extent to which information is easy to manipulate and apply to different tasks. Interpretability. The extent to which information is in appropriate languages, symbols and units and the definitions are clear. Relevancy. The extent to which information is applicable and helpful for the task at hand. Table 4 -IQ Dimension Definitions The response scale was constructed so that a rating of “1” indicated “very strong agreement” with a positive or negative statement such as “It is easy to interpret what this information means” and “The information was difficult to aggregate”. Responses to negative statements were re-coded so that for all items, a rating of “13” indicated “very strong disagreement” with a positive statement. The effect of this was that low ratings indicated positive sentiments (satisfaction) towards the particular information quality dimension, while high ratings indicated dissatisfaction. The scale is ordinal, however with some of the calculations the responses are treated as if an interval scale was in use. Analysis of the data collected in Phase One (Table 5), suggests that apart from the “interpretability” dimension, a reasonably reliable survey instrument (alpha >= 0.70) had been produced. In particular, changes since the Pilot had improved the reliability of the test items for the “Understandability” dimension.
4.1.1. Respondent Demographic This instrument was used to collect data in order to be able to characterise the subjects in terms of attributes that may condition their responses. These respondents were a mix of full-time and part-time students. It was believed that work experience, in general, and managerial experience, in particular, could affect the respondents’ response to the information that they were given. Also a measure was needed of how familiar the respondent was with project management. 4.1.2. Generic Information Quality There are 25 items in the Generic Information Quality (IQ) instrument. Only six of the fifteen information quality dimensions in the AIMQ instruments were relevant to this study. The other dimensions were not appropriate to the study context in which a single authoritative source provided a single information set to all recipients. The items that were used were taken from the AIMQ instrument with minor
6
IQ Dimension
N
Understandability. Completeness. Appropriate Amount. Ease of Operation.
34 33 32 31
N of Reliability Items (Alpha) 4 0.83 3 0.82 4 0.82 5 0.72
10th International Software Metrics Symposium - Chicago 14-16 September 2004
IQ Dimension
N
Interpretability. Relevancy.
31 30
practices that were highly reliant of measures for their completion were targeted for the instrument. The responses to the four test items can enable the identification of those practices: • That are being poorly performed but should be performed well, and • Where the people carrying out the practice are receiving poor information and believe that the better information should be provided.
N of Reliability Items (Alpha) 5 0.47 4 0.80
Table 5 - Phase One IQ Instrument Reliability 4.1.3. Targeted Information Quality There are five probes in the Targeted IQ instrument. Each probe was presented on a separate web form. Each probe targets a specific practice that would be necessary for the students to complete in order to complete their assignment. The targeted practices were identified by reference to the Specific Practices in the CMMI model for the “Project Planning” process area [7]. The practices that were targeted for evaluation are those for which a significant amount of information is required in order to complete the practice. These were: 1. Establish Estimates of Work Product and Task Attributes (CMMI Project Management SP 1.2-1). 2. Determine Estimates of Effort and Cost (CMMI Project Management SP 1.4-1). 3. Establish the Budget and Schedule (CMMI Project Management SP 2.1-1).Two tasks were targeted, they were:
Such practices present the best opportunities for a measurement improvement plan in terms of room to improve and need to improve. In addition, on each probe, there was an information quality question modelled on the AIMQ instrument that refers to the practice and an item for entering a free text comment. A typical probe is shown in Figure 1, below. It contains the following: • A statement to establish the question context • Two test items to evaluate the performance and importance of the target practice. • Two test items to evaluate the performance and importance of measurement support for the target practice. A test item that was based on an AIMQ dimension of Information Quality was included in each probe. The intention was to look for consistency between IQ dimensions at the general level and at the practice level. For the probe shown in Figure 1, the assertion was: The information that you were given in order to estimate the size of work products was complete.
• Develop a schedule • Determine task dependencies 4. Plan for Project Resources (CMMI Project Management SP 2.4-1). The probes follow a format for targeted evaluation described in a paper by this researcher in [5]. The probe contains four items related to the targeted practice (which may be a process, activity or task): 1. An item evaluating the respondent’s perception of how well the practice is currently performed. 2. An item measuring the importance attached by the respondent to performing the practice well. 3. An item evaluating the quality of the information provided in order to carry out the practice. 4. An item measuring the importance attached by the respondent to having high quality information in order to carry out the practice.
The information referred to below is the information that you were provided with in order to Establish Estimates of Work Product and Task Attributes (CMMI Project Management SP 1.2-1). Dimension Performance
Importance
Each test item was stated in the form of a positive assertion. In the method described in [5], the assertions were generated from the statements made by members of the target group of practitioners during focus groups. In this study the assertions were generated by reviewing the student assignment and mapping it to a set of CMMI Specific Practices for Project Planning. Specific
Practice View
Measurement View
Sizing of work products was accurately performed. It is important that work products are accurately sized.
The methods for estimating the size of work products were effective. It is important that there are effective methods for estimating the size of work products.
Figure 1 - An Example of a Targeted IQ Probe An individual's sentiments regarding performance and importance have unique, continuous distributions.
7
10th International Software Metrics Symposium - Chicago 14-16 September 2004
It was decided that a thirteen-point, agree/disagree Likert scale, anchored at each end would provide a reasonable approximation to an interval scale. A distinct “Don’t know” response was provided as a default value for all test items. In the trial reported in [5], the seven-point Likert scale had proved to be too gross for some of the statistical treatments that were tried. The developers of the AIMQ instrument [19] had used an 11 point scale, commenting that: “The use of an 11-point scale is based on previous experience with IQ assessment.” The two scales for performance and importance were visually aligned in order to suggest to the respondents that they should treat performance and importance as being measured on similar interval scales.
for the practical exercise could be allocated among the group members. At the end of that exercise, the students were then asked to complete the Information Quality survey. The practical exercise had required the students to decompose a functional specification and then estimate and schedule the software development tasks involved. While students may have been given varying assistance from their tutors on how to carry out the tasks required for the assignment, all students received the same information to be used as input to those tasks. It is the quality of that input information that was evaluated in Phase One of this study. Eighty students began the survey but only thirty-five students completed all questions. Partially completed response sets were usable because of the randomized order of presentation of questions. This lessened the impact of respondents stopping before the end of the survey in that all questions had an equal chance of being answered. Eighteen respondents made ten or less responses. Three had made only one response and were dropped from the data for analysis. One respondent had made five responses all of the same value and was also dropped. Examination of the remaining fourteen showed no systemic reason why they should be dropped. Four of the respondents did not appear to have entered data in good faith. They were identified by the fact that all their responses were “Don’t know” (the default value) or their response times were so short that it was apparent that were not reading the assertions and giving considered responses. Their responses were removed from the set to be analysed. The default response was “Don’t know” and was treated as a valid response; however such responses were not used for calculations. This is the cause of the smaller sample sizes reported in the analyses. The volunteer respondents were asked for their feedback on the survey and were essentially positive about their participation in the study (see Table 6, below). There were (invited) criticisms of the survey instrument: • the layout & colors used here are visually offensive :) and the help looked too long so I didn't read it. Hope these comments help improve the survey in the future at least. Cheers! • How about just asking what was wrong? The survey seemed ambiguous and repetitive. These positive comments were also received in the feedback on the goals of the survey: • Yes, it worth !! • I wish someone had performed a similar study *last* semester.
4.1.4. Survey Feedback It is important that the process of assessing software measurement is also subjected to evaluation so that improvements can be made until instruments of industrial quality are produced. A timestamp is placed on each response when the enter button is pressed. By tracking all the responses for a respondent it is possible to derive an indicator of the response time to make each set of responses. This is not a true measure of how long the respondent spent on reading the assertions and making their responses; there may have been network delays or the respondent may have been involved in other activities. By excluding outlier values and then calculating the mean time and variance for the remaining set of responses across the respondent sample, it is possible to estimate the maximum time to complete the survey with some confidence. It also enables identification of questions where respondents experienced difficulty In addition to these automatically collected indicators, respondents were asked for their feedback. Subjects responded to five items relating to the value and relevance of the assessment, the usability of the instrument and the comprehensiveness of the items in the instrument. Their responses were measured on the same 13 point scale used in the rest of the assessment. The respondents were also invited to enter a free text comment.
4.2. The Subjects In June 2003, 230 students who had taken a third year undergraduate subject in Software Project Management were asked for provide feedback. As part of their course, the students had completed a practical exercise in groups of up to five. They were required by their lecturer to go to this researcher’s web site in order to complete a Peer Review exercise so that the marks
8
10th International Software Metrics Symposium - Chicago 14-16 September 2004
Table 6 summarises the sentiments of the respondents to participation in the survey. The 13-point response scale was used with a “1” indicating the most positive sentiment towards the survey and “13” indicating the most negative. Construct Able to express respondent’s perceptions Easy to use. Instructions easy to follow. Value for others. Value for self.
N
Mean rating
Pilot mean rating
50 %
35
6.3
6.2
6
35
6.1
6.9
6
33
4.5
5.4
4
34 35
4.7 5.9
4.1 5.7
4 6
In order to provide a greater contrast, the 13-point scales were collapsed. Responses in the range 1 – 4 were categorized as “Positive”, the range 5 – 9 as “Neutral” and the range 10 – 13 as “Negative”. Respondent demographic variables were categorized as “No fulltime work experience” or “Has fulltime work experience”, “Subject not useful” or “Subject useful”, and “No managerial experience” or “Has managerial experience”. Construct Level Analysis The items from the AIMQ model were grouped into their IQ Dimensions (or constructs). The associations at the construct level were then evaluated against the demographic variables. Significant (0.05) values for Pearson correlation coefficients were found between having had Managerial Experience and two constructs: “Ease of Operation” (-0.105) and “Relevancy” (-0.115) implying that the more experience people had, the easier to use and the more relevant they found the information. However, the correlation coefficients were small, suggesting little impact on the responses. Respondents with Full Time Work Experience tended to be more positive about the quality of the information provided, possibly because they knew what to do with the information and what it was useful for.
Table 6 - Respondents’ Feedback on the Survey Comparison with the mean responses for the Pilot Study suggests that work on the instrument since the pilot had been effective in delivering an instrument that was easier to use with better help facilities. While the request for participation had appealed to some students’ altruism, it is clear that some of the volunteer respondents welcomed an opportunity to comment on the subject in general and the academic staff. The overall response rate was disappointing although from an ethics perspective it meant that the students clearly felt no compulsion to complete the assessment. It was not possible to determine whether there was systemic bias that led to the non-respondents. However, one academic who knew the students commented: “I suspect you're going to get some very blank looks from the students. I believe they haven't come across CMMI yet and I suspect the vocabulary you used isn't quite what they're used to”.
Item Level Analysis Associations between responses to individual test items and demographic variables were analysed using the Spearman (non-parametric) method. Seven cases were identified in which there is a significant association (