Toward a Calculus of Confidence Christopher Scaffidi Institute for Software Research School of Computer Science Carnegie Mellon University
[email protected] ABSTRACT
Programmers, and end-user programmers in particular, often have difficulty evaluating software, data, and communication components for reuse in new software systems, which effectively reduces the value programmers derive from those components. End-user programmers are especially ill equipped to exercise the customary high-ceremony means of evaluating software quality. We seek effective ways to use low-ceremony sources of evidence, such as online reviews and reputation data, to make components’ quality attributes easier to establish, thereby facilitating more effective selection of components for reuse. Achieving this will require identifying sources of low-ceremony evidence, designing the meta-information required to track the differing sources and levels of credibility of various sources of evidence, and developing a method for combining pieces of disparate information into overall estimates of component value. 1. CHALLENGES TO REUSE OF COMPONENTS
Many programming tools exist for creating, manipulating, and integrating spreadsheets, web pages, email servers, XML data feeds, databases, and other software components. In many cases, the intended user population is the “end-user programmer,” a person who has enough skill to create simple software but who is not a professional software developer [15]. Recently, there has been a broad initiative to help end users improve the quality of the systems that they create from parts—essentially transitioning users from “programmers” to “software engineers.” (For example, the EUSES consortium [7] has published extensively in this area.) A key trait of software engineering, as distinguished from simply programming, is careful selection of components with desired quality attributes. Compared to simply writing every program from scratch, proper reuse of components can lead to higher quality attributes in the resulting product [1][3] and higher productivity by the programmer. Unfortunately, the attributes of components are rarely transparent. As a result, end-user programmers often struggle to select appropriate components, to coordinate the use of those components, and to understand why components behave in a certain manner [10]. Even professional programmers are sometimes surprised by carefully selected
Mary Shaw Institute for Software Research School of Computer Science Carnegie Mellon University
[email protected] components’ functional and extra-functional attributes [8][16]. When components’ software attributes are hard to predict due to a lack of usable specifications or documentation, then the effective value of those components is significantly reduced. Our goal is to extend the range of evidence that can be effectively used to evaluate components 2. THE ROLE OF EVIDENCE
Enabling programmers to better tap the value of components will involve developing a method and supporting model for reasoning about specific, situated evidence and combining that evidence in order to make judgments about a variety of attributes about a software component or system. For example, we aim to synthesize multiple estimates of a component’s reliability into a useful reliability estimate; this will enable a programmer to compare components on the basis of their reliability. Evidence about software attributes comes from different sources. Four have gained widespread acceptance among computer scientists [17]: formal verification, code generation by a trusted automatic generator, systematic testing, and careful empirical studies of the software in operation. We consider these“high-ceremony” sources of evidence because, like high-ceremony software development processes [2], these sources of evidence require precise specifications and substantial investment of effort. High-ceremony sources of evidence may be unavailable, and end users may find some of them hard to interpret. In actual practice, programmers often rely on more subjective, imprecise, or unreliable “low-ceremony” sources of evidence when selecting software components. Potential sources of low-ceremony evidence include: ●
● ● ● ● ● ●
reviews of components in professional journals (e.g.: [6]), which recommend certain components for certain contexts, essentially playing the role of a Consumer Reports [5] for professional software developers third-party reviews of vendors and products by users (e.g.: amazon.com) recommendations by co-workers or friends; popularity qualitative reasoning [11] advertising claims by vendors branding or seller reputation certification based on subjective criteria
● ●
“best X” reports, often based on linear functions of subjective marks, such as “best schools” or “best doctors” aggregation of group opinion, obtained statistically or through auction, betting, and other financially-inspired mechanisms for tapping the “wisdom of crowds” [18]
Not all pieces of evidence about a software attribute are equally credible; this is especially true of low-ceremony evidence. Each piece of evidence itself has attributes that determine its overall credibility. For example, evidence can be more or less objective, and it can be more or less relevant to a particular context. Evidence may change dynamically, either because it becomes available over time or because attributes actually change, so evidence’s timeliness can affect credibility. Each piece of evidence may be incomplete, either because information is unavailable or because it is expensive to collect. (Other attributes of the evidence may not affect confidence but may affect the interpretation of evidence. For example, different pieces of evidence may be expressed in different units, different scales, or even different kinds of expressions [12].) Low credibility of evidence suggests that the information provided by the evidence might be inaccurate. Thus, low credibility weakens any explicit, intrinsic confidence bounds that the evidence has around the estimate, yielding an adjusted confidence bound. For example, a testing company might determine that a system has an availability of 99.9%, with a confidence bound of ±0.1%; however, if the company happens to be a subsidiary of the system vendor, then the objectivity of the estimate might be called into question, suggesting that the system’s true availability is somewhat lower (but probably not higher). As a result, the resulting adjusted confidence bound on availability might be a range from 90% to 99.99%. Note that credibility and intrinsic confidence bounds on software attributes are typically not statistical in nature. This contrasts sharply with the physical sciences, in which, for example, a measurement of the electron’s charge would have an explicit statistical confidence bound arising from the intrinsic sensitivity of the equipment used in generating the evidence. Thus, it is often difficult to use statistics to combine measurement error and credibility into a confidence bound around one estimate of a software attribute. After adjusting any intrinsic confidence bounds to reflect estimates’ credibility, synthesizing estimates of a value into a single estimate requires giving more weight to evidence with tighter adjusted confidence bounds than to evidence with looser adjusted confidence bounds. However, this weighting might not be as simple as computing a weighted average, as the evidence may not be mathematically simple scalars or well-behaved functions. For more complex situations, a more sophisticated model will be required. For example, if three testing labs estimate that a component’s availability is 90%, 95%, and 97%, respectively, with equal confidence bounds and credibility for each esti-
mate, then a reasonable synthesis simply averages them with equal weight (yielding an estimate of 94%). As a second example, suppose that the first lab claims a confidence bound of ±5% while the other two claim a confidence bound of ±1%, though the second lab performed its tests on an outdated version of the component. Then the first estimate has a relatively loose intrinsic confidence bound, and the second estimate has relatively weak credibility, so each of these has a relatively looser adjusted confidence bound than does the third estimate. Consequently, the synthesis of these three estimates will be higher than the first synthesized estimate (94%). At present, however, it is not clear how much higher the second estimate should be, compared to the first. The reason is that no well-validated method exists for incorporating credibility into confidence bounds and the overall process of synthesizing estimates. We intend to develop such a method, which we refer to as a “calculus of confidence.” 3. APPROACH AND EVALUATION
Although existing research has not yet integrated credibility into synthesis of estimates, prior work has developed methods for incorporating statistical confidence bounds. These methods of using statistical confidence will serve as candidate starting points for our calculus of confidence. ●
● ●
●
● ●
If estimates are modeled as Gaussian probability distribution functions (PDFs), then estimates can be averaged, with each weight inversely proportional to the respective distribution’s variance [19]. If estimates are modeled as beta PDFs, then they can be easily combined using Bayesian analysis [19]. If estimates are modeled as discrete PDFs, then the consensus of estimates can be identified through data compression, using information theory algorithms [9]. If estimates are modeled as ranges, and the user wants to know the best and worst cases, then the minimum and maximum can be computed. If the user needs to bound the reasonably likely best and worst cases (a common practice in disaster management and software security [4]), then the 10%/90% cases can be identified. If the estimates are modeled as constraints, then they can be conjoined. If the user has preferences about kinds of evidence to trust, then these can be reflected in the synthesis. For example, the user might prefer to trust the least selfinterested source, or might prefer to trust the most recent measurement of a regularly changing value.
Rather than simply extending an existing statistical method to accommodate non-statistical confidence bounds, we will use a data-driven approach. Specifically, we will study the different types of evidence that users synthesize, and then we will evaluate these existing methods as a basis for designing our calculus of confidence.
We will begin by identifying a representative set of scenarios that exemplify component-selection decisions. To find these scenarios, we will review various studies of end-user programmers (such as [13], [14], and [20]). These scenarios will not be complete in the sense of identifying all possible kinds of component-selection processes, but they will be helpful for guiding our modeling of decision-making. We will then analyze these scenarios to identify what sources of evidence help programmers to make wise selections of components. In addition, we will examine the datacombination challenges that arise in those scenarios, particularly when the best available evidence is from lowceremony sources. We will determine what attributes of this evidence are responsible for determining the credibility of the evidence and for shaping any implicit confidence bounds on estimates of the software attributes. Finally, we will evaluate statistical methods like those above by examining how well they model decision-making in our set of scenarios. The mismatch between these models and real component-selection decisions will guide development of a calculus of confidence, which we will evaluate more rigorously in future investigations. 4. CONCLUSION
In situations that do not involve critical dependability, the cost and difficulty of analyzing software based solely on high-ceremony evidence may be beyond the reach of many software developers, especially end-user programmers. Indeed, many users already rely heavily, though perhaps implicitly, on low-ceremony evidence, and professional programmers often consider such evidence as well. We are seeking to make the use of low-ceremony evidence more systematic and to make its limitations easier to reason about. We believe that providing assistance to people who use such data will improve their ability to evaluate software and components more than will exhorting them to restrict themselves to high-ceremony evidence. 5. ACKNOWLEDGMENTS
This work was funded in part by the National Science Foundation (ITR-0325273) via the EUSES Consortium and by the National Science Foundation under Grants CCF0438929 and NSF-CNS-0613823. We thank the participants of Dagstuhl Seminar 07031 on Software Dependability Engineering, especially Mikael Lindvall and John McGregor, for discussions that led to refinement of the ideas. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.
[2] G. Booch. Developing the Future. Comm. ACM, 44, 3 (2001), 118-121. [3] M. Burnett, C. Cook, G. Rothermel. End-User Software Engineering. Comm. ACM, 47, 9 (2004), 53-58. [4] S. Butler. Security Attribute Evaluation Method: A CostBenefit Approach. Proc. 24th Intl. Conf. Soft. Eng., 232-240 [5] Consumer Reports, ISSN 0010-7174, published by Consumers Union, http://www.consumerreports.org [6] Dr. Dobb's Journal, ISSN 1044-789X, published by CMP Media LLC, http://www.ddj.com [7] EUSES Consortium, http://eusesconsortium.org [8] D. Garlan, R. Allen, J. Ockerbloom. Architectural Mismatch or Why It's Hard To Build Systems out of Existing Parts. Proc. 17th Intl. Conf. Soft. Eng., 1995, 179-185. [9] J. Kelly A New Interpretation of Information Rate, The Bell System Technical Journal, 35 (July 1956), 917–926. [10] A. Ko, B. Myers, H. Aung. Six Learning Barriers in End-User Programming Systems. Proc. Symp. Vis. Lang. and Human-Centric Computing, 2004, 199-206. [11] J. McGregor, T. Inn. A Qualitative Approach to Dependability Engineering, Dagstuhl Seminar 07031 on Software Dependability Engineering, January 2007. [12] V. Poladian, S. Butler, M. Shaw, D. Garlan. Time Is Not Money: The Case for Multi-Dimensional Accounting in Value-Based Software Engineering. 5th Workshop on Economics-Driven Soft. Research, 2003, 19-24. [13] M. B. Rosson, J. Ballin, J. Rode. Who, What, and How: A Survey of Informal and Professional Web Developers. Proc. Symp. Vis. Lang. and Human-Centric Computing, 2005, 199-206. [14] C. Scaffidi, A. Ko, B. Myers, M. Shaw. Dimensions Characterizing Programming Feature Usage by Information Workers. Proc. Symp. Vis. Lang. and HumanCentric Computing, 2006, 59-62. [15] C. Scaffidi, M. Shaw, B. Myers, Estimating the Numbers of End Users and End User Programmers, Proc. Symp. Vis. Lang. and Human-Centric Computing, 2005, 207-214. [16] M. Shaw. Truth Vs Knowledge: The Difference Between What a Component Does and What We Know It Does. Proc. 8th Intl. Workshop on Soft. Spec. and Design, 1996, 181-185. [17] M. Shaw. Writing Good Software Engineering Research Papers. Proc. 25th Intl. Conf. Soft. Eng., 2003, 726-736. [18] J. Surowiecki. The Wisdom of Crowds, Anchor, 2005.
6. REFERENCES
[19] D. Wackerly, W. Mendenhall, R. Scheaffer, Mathematical Statistics with Applications, Duxbury Press, 2001.
[1] V. Basili, L. Briand, W. Melo. How Reuse Influences Productivity in Object-Oriented Systems. Comm. ACM, 39, 10 (1996), 104-116.
[20] S. Wiedenbeck. Facilitators and Inhibitors of End-User Development by Teachers in a School Environment. Proc. Symp. Vis. Lang. and Human-Centric Computing, 2005, 215-222.