M. Xenos, “Lessons learned on measuring perceived software quality”
LESSONS LEARNED ON MEASURING PERCEIVED SOFTWARE QUALITY Michalis Xenos
SUMMARY This paper, which summarises work on measuring users’ opinions of the quality of software products carried out the last four years, presents the experience gained from measuring users’ perception of software quality. Although this paper briefly presents the survey method and the techniques used to ensure the quality of the responses, it mainly emphasises the problems related to such measurements and the human factors that the measuring team has to overcome. It also presents examples from the measurements results and the underlining philosophy of such soft measurements. Finally, this paper presents how these soft measures are performed within the overall measurements approach and the adopted quality methodology. Dr. Michalis Xenos, Computer Technology Institute, Research Unit II: Software Engineering and Applications, Unit Manager, Phone: +(30 61) 960.336, Fax: +(30 61) 997.783, e−mail:
[email protected].
1. INTRODUCTION This paper discusses the issue of ‘perceived software quality’. Normally, when discussing perception, we imply ‘customer perception’. Although the term ‘customer’ is commonly used in relation to quality in any other field, when discussing ‘software quality’, the term ‘user’ is the most commonly used. In order to respect this tradition, we will use the term ‘user’ in this paper. Therefore, this paper discusses ‘user perception of software quality’. When applying techniques of quality assurance in a product’s manufacture, it is clear who the customer is. However, in a software quality assurance programme the ‘user’ is not clearly defined. It is assumed that the user is the consumer who receives the software and uses it for business or leisure purposes and this is correct. But there are more ‘users’ of the software. For example, both the testing and maintenance departments ‘use’ the software and expect it to meet a set of quality criteria. Freelancers ‘use’ their own software, since they usually play the role of testers and maintainers. Companies ‘reuse’ parts of code and software components that they had either built previously or bought from subcontractors. This individuality distinguishes software from every other product since, within software production, almost everyone that participates in the production plays the role of the user at least at some level.
Proceedings of the FESMA99 International Conference, Federation of European Software Measurement Associations, Amsterdam, The Netherlands, pp. 349-356, 1999.
1
M. Xenos, “Lessons learned on measuring perceived software quality”
Therefore ‘user perceived quality’ must be handled with an entirely different approach in software, than ‘customer perceived quality’ in any other type of product manufacture. Due to this uniqueness, techniques used to measure customer satisfaction and product quality are not always easy to be used when measuring user perceived software quality. Easier to be adopted in user perceived quality measurements are the measurement techniques used to measure perceived quality of service [EVV, 94]. However, there are more difficulties to overcome. Software evolves and changes rapidly. User needs and user perception of quality changes drastically over time. Surveys of user opinion need to be conducted very frequently and to offer accuracy in the measurement results. Furthermore, we need to be able to predict changes (revisions) in user perception. Although internal software metrics (also called ‘hard metrics’, as we will discuss in the following section) could provide hints about software quality, there is a need to be able to conduct rigorous measurements of user opinion at any level of software production. The internal metrics (at least as they have evolved up to now) cannot be effectively used without conducting simultaneously user perceived quality measures. By no means, should it be thought that is a paper against internal software metrics. Their use is indisputable, but it is more effective when used in accordance with external metrics (also called ‘soft metrics’). In the following section of this paper, the philosophy of soft measures and measurements is presented; how such measurements can be related to hard measurements is discussed and how they are used to calibrate hard measure tools is analysed. Additionally, the relations between soft and hard measures are presented as well as the aim of using them. Moreover, requirements of soft measures by standards and awards are documented and the problems when conducting soft measurements are introduced to the reader. Furthermore, the use of such measures company–wide and the related problems are discussed. In the third section, an overview of the soft measurement techniques we had used for the last four years is presented. Finally, in section four of the paper, issues about the customer reaction to measurements and lessons learned when measuring such soft measures are presented. Additionally, in this section, this paper presents how these soft measures are performed within the overall measurement approach and the adopted quality methodology.
2. SOFT AND HARD MEASUREMENTS Internal software measures, also called ‘hard measures’, [JON, 91] are measures of things that can be quantified with little or no subjectivity. For the hard data elements, high accuracy is both possible and desirable. Internal software measures are very frequently used in almost all software quality assurance programmes. External software measures of user perceived quality, also called and ‘soft measures’, refer to elements in which human opinions must be evaluated. Since human opinions will vary, absolute precision cannot be achieved for soft measures. Nevertheless, if external measures are taken rigorously, they will explain variations in project outcomes and offer valuable data to any quality assurance department. Internal measures were collected using software metrics while external measures were collected by conducting surveys. As argued [KAP, 95], surveys allow focusing on just the issues of interest since they offer complete control on the questions being asked. Furthermore, surveys are quantifiable and, therefore, are not only indicators in themselves but also allow the application of more sophisticated analysis techniques appropriate to organisations with higher levels of quality maturity. Moreover, the use of Proceedings of the FESMA99 International Conference, Federation of European Software Measurement Associations, Amsterdam, The Netherlands, pp. 349-356, 1999.
2
M. Xenos, “Lessons learned on measuring perceived software quality”
external measures is strongly encouraged by international standards such as the ISO9001 [ISO, 91], the IEEE quality standards [IEEE, 89], the Capability Maturity Model [PAU, 93] and the Baldrige awards [BRO, 91]. However, despite these factors, external measures are not frequently used or their use is limited to customer satisfaction measures carried out by the marketing department aiming something entirely different and not to measure user perceived software quality. Both methods (internal and external measures) offer much to the quality assurance team, but may also present problems if used separately. As follows, we present advantages and disadvantages of internal and external measures. Internal measures are easy to collect and, in most cases, they are automated. The collection of such metrics results is cost effective and the results are easily analysed with statistical methods and tools. On the other hand, the metric results are difficult to interpret and to be correlated to external quality characteristics. In most cases (if measures are conducted without the aid of external measures) there is a tendency to measure internal quantities with little relation to external quality characteristics. Another problem of internal measures, as we have shown [XEN, 96] is that internal measures cannot detect all problematic software elements. External measures were used to directly measure the desired external product quality characteristics. Furthermore, their use is based on the definition of quality (satisfied users) and encouraged by international standards and awards. On the other hand, external measures are neither objective nor cost effective. Moreover, measurements are difficult to analyse due to high error rates and to the use of various data types. Methods [XEN, 97] to deal effectively with most problems of external measurements have been proposed and used. The conclusion that can be arrived at from the above discussion is that internal measures provide an easy and inexpensive way to detect and correct possible causes for low product quality as this is perceived by the users. Setting up measurement programs and metric standards will help in preventing failures in satisfying users’ demand for quality. However, satisfaction of internal quality measures is not an a priori guarantee of success in fulfilling this users’ demand for quality. Programs that will succeed in the internal measures perhaps will not receive the same acknowledgement by users. Our suggestion is that external measures should be used in parallel with internal measures in order to firstly, ensure the accuracy of internal measures, secondly, detect problematic software elements that have been passed the internal metrics and, lastly, measure directly external quality characteristics. Furthermore, deployment of external measures could be used in order to test the soundness of the internal measures and, occasionally, even to calibrate internal metrics. Quality assurance teams must never forget that, despite what internal measurements indicate, the final judge of the quality of the produced software is the user.
3. MEASURING USER PERCEIVED QUALITY External measurements of user perceived quality were conducted with surveys. As mentioned in the previous section, surveys have to overcome three major problems: (1) subjectivity of the answers, (2) inability to of weigh users’ opinions according to their qualifications and (3) errors in the users’ choices of responses. Subjectivity of the answers will always be a problem with surveys. However, there are many techniques to use in order to help the user express his or her opinion using predefined measurement scales and multiple choice answers related to specific and Proceedings of the FESMA99 International Conference, Federation of European Software Measurement Associations, Amsterdam, The Netherlands, pp. 349-356, 1999.
3
M. Xenos, “Lessons learned on measuring perceived software quality”
soundly defined quality characteristics. The adoption and application of simple rules when planning the survey and designing the questionnaire will improve the correctness of the measurements. The quality engineer who is setting up a survey using questionnaires must follow guidelines [XEN, 97] on how to structure the questionnaire formally in order to minimise subjectivity due to various interpretations of questions or choice levels. In many cases (for example when measuring perceived quality of a pre-released product or component within a company) it is not correct to weigh all user opinions equally. Averaging survey data does not take into respect the significance of each user’s opinion. Therefore, there is a need for techniques that will evaluate users’ opinions according to their qualifications. The techniques we have proposed [XEN, 94] and used, take into account user qualifications and weigh user opinion based on their qualifications. The major problem with surveys is the frequent errors. Due to the nature of surveys, incorrect responses will occur. Such incorrect responses are responses not representing user opinion and are called errors. In our surveys conducted over the last four years, we have measured a significant number of such errors caused by various factors that might seem extreme, but do occur. Such errors can be prevented by following the simple rules presented previously, but cannot entirely be eliminated. Since we cannot prevent errors we have used a set of techniques to detect errors. In our external measurement surveys, we have used safeguards [XEN, 95] to detect errors. Safeguards are questions placed inside the questionnaire so as to measure the correctness of responses and not aimed at measuring user perceived quality. They are control questions aiming at detecting errors. When surveying users’ opinions for such a drastically changing product as software we need a method [STA, 98] to be able to predict belief revisions on users’ opinions. Such a method ensures that we do not need to conduct a survey as frequently as before since users change their requirements rapidly as new software products were released and this affects their opinion of the measured product. In the following section, we discuss lessons learned while trying to apply all these theories in practice.
4. CUSTOMER REACTION TO MEASUREMENTS 4.1 Frequency of external measurements In the last four years, we have conducted a large number of surveys measuring perceived software quality. Many of these surveys were carried out in order to collect external measurements that will be used in our software projects and a few were carried out simply for research purposes. Such research surveys were aimed of collecting data on error rates, on user motivation, etc and not to be used within a software project. Since some of the measurements were conducted before the final release of the software (mostly measuring perceived quality in pre-releases, or beta versions), the results of the measurements were used in order to improve the quality of the final software product. In most cases, just by carrying out a survey, make the users feel that the company that carries out such a survey care and respect their opinion. In a project that we have developed a storage system for a political party, one user said: “…by asking my opinion you make me feel important, like I am working on your team to make this program better…”. A user that feels this way will always be glad to help in any survey and to respond thoughtfully without making many errors. The responses collected from Proceedings of the FESMA99 International Conference, Federation of European Software Measurement Associations, Amsterdam, The Netherlands, pp. 349-356, 1999.
4
M. Xenos, “Lessons learned on measuring perceived software quality”
highly motivated users were the ones that have the lowest error rates. On the other hand, the questionnaires used for the survey must be designed to make the user feel this way. There is always a risk that the user will feel that the aim of the questionnaire is not to measure perceived quality but to test his or her skills in using the software. In cases that we have recently installed a program and we measure user opinions of users that were still on training, we discovered that we had to be extremely careful not to make the users think that the purpose of the survey was a hidden method to measure their learning progress. Also, there is always the danger of using surveys too much and to make the users tired of the process. The use of belief revision prediction techniques helped us reduce the frequency of the surveys. External measurements carried out many times with the same users were found to contain a high number of errors since the users have lost their initial enthusiasm. Our suggestion is to measure perceived software quality in a manner that make the users feel that their opinion is important, but not to repeat such measures many times.
4.2 Survey method There is a hated debate as to what type of survey is best to use. Interviews guarantee a large number of responses and a low error rate (since the interviewer will always be there to assist) but contain the risk that the responses will be affected be the interviewers opinions. (Especially when the interviewer is a member of the company which has created the software and, therefore, has high expectations of the measures of the perceived software quality). In our examples, we have used questionnaires and a form of the so-called mail survey. We have found that in a mail survey (e-mail survey in most cases) we tend to get responses only from 10% to 20% of the persons asked and these responses still had a high error rate. Therefore, when this was applicable, we decided to send a person to handle the questionnaires and to collect the responses, but with specific instructions not to interfere with the measurement process and not to offer any assistance to the users on how to respond. This person was only responsible to ensure that all the users will spend time to read the questionnaire and reply themselves, to collect the responses and bring them back.
4.3 Measuring users qualifications In section 3 of this paper, we discussed techniques we have used in order to take into account the significance of each user’s opinion and to evaluate users’ opinions according to their qualifications. Using such a technique (namely the QWCO ‘Qualifications Weighed Customers Opinion’) we need to measure not only users’ opinions but also their qualifications. When measuring users’ qualifications, human factors affect the results of such measurements. People tend to exaggerate when presenting their qualifications, especially when they feel that the truth will seem negative on the response sheet. The choice of words to describe the predefined choices and the way these choices were rated is very important. For example, during a survey carried out entirely for research purposes (in a project for educational software), we have asked the some question about years of experience in a specific software product twice. Once using the choices: (1) less than one year, (2) one to three years, (3) four to five years, (4) six to eight years, (5) over eight years, and once (in a different part of the questionnaire and just by rephrasing the question) using the choices: (1) less than a month, (2) one to three months, (3) for months to half a year, (4) half year to a year, (5) over a year. In the first question the vast majority of the responses was not the first choice, although almost none of the responses of the second question was the fifth choice! People never felt Proceedings of the FESMA99 International Conference, Federation of European Software Measurement Associations, Amsterdam, The Netherlands, pp. 349-356, 1999.
5
M. Xenos, “Lessons learned on measuring perceived software quality”
good admitting that they fit in the last category when talking about their experience, or qualifications. Therefore, if you expect a large number of people to be in the “less than a year” category, it is always better to divide this category into three new ones, even if you connect them again upon analysing the results. This problem is more emphatic when measuring qualifications in a survey inside the company. There is always the fear that people might think that this survey is a method to gauge their skills or their progress, or to judge them and tend to be defensive or negative towards such surveys. The quality department must clarify the purpose of the survey, must guarantee anonymity and the questionnaire must make certain that its purpose is only to measure perceive software quality.
4.4 Users belief revision Changes in users’ opinions are frequent when measuring perceived software quality. This is mostly observed in cases of a product, which is used by a number of inexperienced users. Such users will form a stable opinion after using the software product for a long period of time. The length of this period depends on the complexity of the product, the number and the variety of the functions it supports, the amount of usage and the conditions under which usage occurs, as well as familiarity with similar software products. This period of time, according to our measures, usually varies from six to twelve months. During this period, the opinion of a user might drastically change. In a project, that we have developed a bibliographic database system, a user was initially very excited with the new facilities this product brought to his work that he initially perceived that the quality of this software was quite good. After a short period of time, he though that the quality of software was no that good, since he was so overwhelmed by the functions the software offered that he expected miracles from the software (for example to provide information about fields that wasn’t even stored in the database). Finally, after almost a year, and after he realised the true potentials the technology could offer, he formed a solid opinion about software quality. In order not to need to measure such users’ opinions repeatably, before they form their final opinion, we have used techniques from belief revision aiming at predicting where their final opinion will fluctuate and when the best time is to conduct the next external measures.
4.5 Analysing the results The important issue from surveys is not the collection of data, but the results that will be derived from their analysis. We have used the external measurements of perceived software quality in parallel with our internal metrics programme. Our research shows that projects that fail in the internal metrics, in the vast majority of cases fail also in the user perceived quality measures. On the other hand, in many cases a good result in internal metrics does not guarantee success in the external metrics. Therefore, deployment of user perceived quality measurements could be used in order to test the soundness of the internal metrics programme and, occasionally, even to calibrate internal metrics (aid in modifications in the quality manual). Quality assurance teams should never forget that, despite what internal measurements indicate the final judge for the quality of the produced software is the one who uses it.
Proceedings of the FESMA99 International Conference, Federation of European Software Measurement Associations, Amsterdam, The Netherlands, pp. 349-356, 1999.
6
M. Xenos, “Lessons learned on measuring perceived software quality”
5. CONCLUSION Measurements of perceived software quality are important to the quality department of any company. Such measures provide direct information related to software quality characteristics and can lead to changes in the software production and error prevention. Furthermore, such measurements aid in calibrating the internal measurement process and to test the validity of the internal metrics. However, for such measurements to be effective, they must be carried out rigorously, by expert and trained personnel, supported by a solid theoretical background, with predefined and clear measurement goals and with the same methodical approach as the internal measurements.
REFERENCES [BRO, 91]
Brown, M G, ‘Baldrige Award Winning Quality: How to Interpret the Malcom Baldrige Award Criteria’, Milwaukee, WI: ASQC Quality Press, 1991.
[EVV, 94]
Evvardsson, B, Thomasson, B, and Ovretveit, J, ‘Quality of Service. Making it Really Work’, McGraw Hill, isbn: 0-07-707949-3, 1994.
[FEN, 97]
Fenton, N, Pfleeger, S, ‘Software Metrics A Rigorous & Practical Approach’, Second Edition, Thomson Computer Press, isbn: 1-85032275-9, 1997.
[IEEE, 89]
IEEE, ‘Standard for a Software Quality Metrics Methodology’, P1061/D20, IEEE Press, New York, 1989.
[ISO, 91]
ISO, ‘Quality Management and Quality International Standard, ISO/IEC 9001: 1991.
[JON, 91]
Jones, C, ‘Applied Software Measurement: Assuring Productivity and Quality’, McGraw Hill, isbn: 0-07-032813-7, 1991.
[KAP, 95]
Kaplan, C, Clark, R, and Tang, V, ‘Secrets of Software Quality’, McGraw Hill, isbn: 0-07-911795-3, 1995.
[LAH, 92]
Lahlou, S, Van der maijden, R, Messu, M, Poquet, G, and Prakke, F, ‘A Guideline for Survey Techniques in Evaluation of Research’, Brussels, ESSCEEC-EAEC, 1992.
[PAU, 93]
Paulk, M, Curtis, B, Chrissis, M, and Weber, C, ‘Capability Maturity Model for Software’, Software Engineering Institute, CMU/SEI-93-TR-024, 1993.
[STA, 99]
Stavrinoudis, D, Xenos, M., Peppas, P, and Christodoulakis, D, ‘Measuring User's Perception and Opinion of Software Quality’, 6th European Conference on Software Quality, EOQ-SC, Vienna, pp. 229237, 1999.
[XEN, 94]
Xenos, M, and Christodoulakis, D, ‘Software Quality: The user's point of view’, International Conference on Software Quality and Productivity, Hong-Kong, Sponsored by IFIP, Published by Chapman and Hall Publications, ‘SOFTWARE QUALITY AND PRODUCTIVITY: Theory,
Assurance
Standards’,
Proceedings of the FESMA99 International Conference, Federation of European Software Measurement Associations, Amsterdam, The Netherlands, pp. 349-356, 1999.
7
M. Xenos, “Lessons learned on measuring perceived software quality”
practice, education and training’, Edited by Matthew Lee, Ben-Zion Barta and Peter Juliff, isbn: 0-412-629607, pp 266-272, 1994. [XEN, 95]
Xenos, M, and Christodoulakis, D, ‘Evaluating Software Quality by the Use of User Satisfaction Measurements’, 4th Software Quality Conference, SET, University of Abertay, Dundee, pp. 181-188, 1995.
[XEN, 96]
Xenos, M, Stavrinoudis, D, and Christodoulakis, D, ‘The Correlation Between Developer-oriented and User-oriented Software Quality Measurements (A Case Study)’, 5th European Conference on Software Quality, EOQ-SC, Dublin, pp. 267-275, 1996.
[XEN, 97]
Xenos, M, and Christodoulakis, D, ‘Measuring Perceived Software Quality’, Information and Software Technology Journal, Butterworth Publications UK, Vol. 39, pp. 417-424, 1997.
Proceedings of the FESMA99 International Conference, Federation of European Software Measurement Associations, Amsterdam, The Netherlands, pp. 349-356, 1999.
8