Measurement for evaluating the learnability and ... - Semantic Scholar

Theoretical Issues in Ergonomics Science 2010, 1–15, iFirst

Measurement for evaluating the learnability and resilience of methods of cognitive work Robert R. Hoffmana*, Morris Marxa, Raid Aminb and Patricia L. McDermottc a

Downloaded By: [Hoffman, Robert] At: 17:31 1 April 2010

Institute for Human & Machine Cognition, 40 South Alcaniz St., Pensacola, FL 32502–6008, USA; bDepartment of Mathematics and Statistics, University of West Florida, 11000 University Parkway, Pensacola, FL 32514–5750, USA; cAlion Science Micro Analysis and Design Operation, 4949 Pearl East Circle, Suite 300, Boulder, CO 80301, USA (Received 7 November 2008; final version received 2 October 2009)

Some experiments on human–computer interaction are aimed at evaluating hypotheses concerning cognitive work. Other experiments are intended to evaluate the software tools that shape the cognitive work. In both cases, effective experimentation is premised on the control and factorial analysis of sources of variability. This entails programmes of experimentation. However, sociotechnical systems are generally a ‘moving target’ in terms of the pace of change. The objective of this study was to create a general approach to experimental design and the measurement of cognitive work that can satisfy the requirements for experimentation and yet can also provide a ‘fast track’ to the evaluation of softwaresupported cognitive work. A measure called i-bar is presented, which is the inverse of the mid-range. The statistic is derived from data on trials-to-criterion in tasks that require practice and learning. This single measure is interpreted as a conjoint measurement scale, permitting: (a) evaluation of sensitivity of the principal performance measure (which is used to set the metric for trials to criterion); (b) evaluation of the learnability of the work method (i.e. the goodness of the software tool); (c) evaluation of the resilience of the work method. It is shown that it is possible to mathematically model such order statistics and derive methods for estimating likelihoods. This involves novel ways of thinking about statistical analysis for discrete non-Gaussian distributions. The idea and method presented herein should be applicable to the study of the effects of any training or intervention, including software interventions designed to improve legacy work methods and interventions that involve creating entirely new cognitive work systems. Keywords: cognitive work; performance measurement; learnability; range statistics; technology evaluation; resilience

1. Background When intelligent technology is introduced into the sociotechnical workplace, it represents hypotheses about how the cognitive work is expected to change (Woods 1998). Explicitly, the hypothesis is that change will be for the better and that performance will be more effective and decisions will be improved. However, experience shows that intervention can *Corresponding author. Email: [email protected] ISSN 1464–536X online ß 2010 Taylor & Francis DOI: 10.1080/14639220903386757 http://www.informaworld.com


2

R.R. Hoffman et al.

induce states of negative hedonicity, which is frustration due to user-hostile aspects of the interface and surprise when the automation does things the worker does not understand (Koopman and Hoffman 2003, Hancock et al. 2005). During the process of procuring information technology, many events occur that result in technologies that are not human-centred – that are not usable, useful or understandable. For instance, end-user involvement may have been inadequate to achieve a positive impact. Or the usability evaluation might have depended solely on a subjective questionnaire. The failure to achieve human-centring results in a gap between the work that people have to conduct because of the work model of the software vs. the true work that they need to perform in order to achieve their primary task goals (Vicente 1999). Tell-tale signs are user-created kluges and work-arounds (Laplante et al. 2007). The notorious frustrations and failures triggered by software interventions (Goguen 1997, Hoffman and Elm 2006, Neville et al. 2008) have led to a significant concern with evaluation in the software engineering and software human factors communities. Considerable attention is being given to usability engineering methods (e.g. Grudin 1992, Nielsen 1993, Shneiderman 1998, Rosson and Carroll 2002) and metrics for ‘key performance indicators’ (e.g. Schaffer 2004, O’Neill 2007) that assess work in terms of efficiency and effectiveness. Hoffman et al. (2008) suggested a simple and direct means of measuring negative hedonicity. This article discusses another challenge for metrics, and suggests a class of solutions.

2. Some challenges for the research in ‘research and development’ Research on human–computer interaction is often intended to support the evaluation of hypotheses concerning the nature of the cognitive work (e.g. the effects of distance collaboration on team performance) but at the same time research must be conducted to evaluate the new software tools themselves. New technologies might be thought of as part of the ‘materials and procedure’ that comprise the method of an experiment on human– computer interaction. But the new technologies must themselves be evaluated for effectiveness as a component within a cognitive work system. A number of problems are entailed by this challenge.

2.1. The moving target problem According to the Moving Target Law of Human-Centered Computing (Hoffman and Woods 2005, Ballas 2007), the sociotechnical workplace is constantly changing, and change in environmental constraints (the workplace and its technologies) will entail change in cognitive constraints (e.g. how decision making is helped or hampered). Complex cognitive work mixes legacy technologies and work methods with new ones, addressing both old problems and emerging challenges. By the time a detailed task analysis (e.g. hierarchical task decomposition; see Shepherd 2001) has been conducted, the work will almost certainly have changed. Thus, the more detailed the decomposition the more likely it will quickly become obsolete. This results in what is called ‘the fundamental disconnect’; the time frame for effective programmatic experimentation on hypotheses about cognitive work cannot match the pace of change in the work and the technology of sociotechnical systems.

Theoretical Issues in Ergonomics Science

3

2.2. The baseline problem In some cases, new technologies (decision support systems, interfaces, etc.) are designed to supplant old or existing technologies and the cognitive work is to remain essentially the same. There might be an empirical base against which to compare performance using current/older technologies vs. new technologies. But today, technologists (and workers) face situations in which new technologies are designed to support new work methods, to facilitate coping with entirely new and emerging situations. Examples would be ‘net-centric intelligence analysis’ and ‘cyberwar’. In such novel domains there are few meaningful baselines against which one might directly gauge the goodness of new work methods, whether goodness is thought of in terms of efficacy or efficiency.


2.3. The complexity problem In stating the fundamental disconnect above, ‘effective experimentation’ was referred to. What counts as ‘effective’ experimentation needs to be defined within the paradigm of experimentation itself. Its logic requires series of studies, in which variables are isolated and manipulated or controlled, to achieve any certainty about the causal relations of task factors (e.g. display designs, software features, etc.) to performance. Multiple experiments are always required to peg down cause–effect relations when it comes to human–computer interaction. (For years, the policy of the prominent experimental psychology journals has been that reports have to consist of discussions of converging series of experiments; see Kiess and Bloomquist 1985.) Complexity has entailed issues for the social sciences: ‘The major difficulty for [generalisation from sample to population] is that it is difficult to sample all the things that must be sampled to make a generalization . . . the sheer number [of interacting factors] can lead to unwieldy research plans’ (Firestone 1993, pp. 18, 19; see also Chronbach 1975). Complexity has entailed issues for cognitive science, which George Miller summarised as ‘the dismembering of cognition’ (Miller 1986; see also Newell 1973, Jenkins 1974). Complexity also spells problems for the study of envisioned worlds of cognitive work. There are far too many variables that have to be taken into account and hence too many experimental tests that have to be conducted. Variables of importance include such things as features of the participants (experience, intelligence, motivation, aptitude, etc.), features of the test scenarios (interesting, rare, easy, boring, etc.), features of the teams (co-located, asynchronous, dysfunctional, etc.) and features of the tools (e.g. it is sometimes quite easy for human factors engineers to make better interfaces than ones made from a designer-centred design approach). There may be numerous other important factors to add to such a list of proliferating variables.

2.4. The problem of requirements creep Requirements for software are emergent and cannot be completely cast in stone as a first step in a stepwise requirements engineering process (Goguen 1997, Hoffman and Elm 2006, Neville et al. 2008). At a minimum, the process must involve iteration and there must be a substantive collaboration among software engineers and cognitive systems engineers, given their complementary roles (creating software vs. forming the empirical base on which to design the functionality of the software) (Hoffman 2008, Hoffman and Deal 2008), but even these do not hold complexity in check. To be effective, cognitive work systems must be adaptable and resilient (Hollnagel, Woods and Leveson 2006). They must

4

R.R. Hoffman et al.

be able to change and sometimes be able to change the way they change. Historically, it has been felt in software engineering that ‘requirements creep’ is a bad thing, a main source of problems for the programmer. Creep is an empirical fact, however. The problem seems to be that system development is thought of as a controlled serial process, when, in fact, it consists of continuous, parallel, interacting processes (envisioning, design, evaluation, refinement, validation, etc.) (Hoffman and Elm 2006).


2.5. Implications These problems mean, from an experimentalist point of view, that a procurement process should hinge on a rather lengthy series of studies (i.e. to go beyond mere satisfying). That would be time consuming and costly. Procurement would take even longer than it already does, just at a time when a main aim is to reduce overall procurement time (see Hewish 2003). By the time the relevant factors have been controlled, key variables isolated and effect sizes estimated, design requirements changed and re-evaluated, etc., the cognitive work will almost certainly have passed on to other incarnations. Therefore, it is necessary to escape the fundamental disconnect between the time frame for experimentation and the time frame for effective change in the world of unique events.

3. The designer’s gamble In the standard view of hypothesis testing, real-world variability must be restricted either by being controlled or manipulated. However, all of the variability of the world is in effect when the software systems are to be used in conducting the actual work. Intelligence, motivation, alertness, problem difficulty and many other factors might be moderator variables that influence the relation of learning and performance (Baron and Kenny 1986) and can be in play when any new technology is being used in a changed work system (e.g. a command post). So why should one assume that mediating variables have to be controlled when work methods are being evaluated? Indeed, one wants the daunting variability of the world to be preserved in the evaluation of new technologies. We express what we call ‘the designer’s gamble’: We, the designers, believe that our new technology is good, and that good work will result from its use. Thus, we can let the daunting variability of the real world remain in the summary statistics and measurements, and we can conduct reasonably risky tests of usefulness and usability. We’re going to gamble that the software/tools and new work methods are so good that the cognitive work (human–system integration, etc.) will be good despite the daunting variability of the real world.

Table 1 expresses some of the extremes. The designer’s gamble is by no means a fantasy. Just as funding programme announcements sometimes seems to ask for the world, research proposals often promise it. Statements of the following general type often appear in grant proposals and pre-proposal white papers: We will develop new modelling strategies leveraging previous research in dynamic networked environments. This architecture will provide near real-time interoperability and robustness and will allow the detection and modelling of information flows and actions and mitigate data overload. This will then be integrated with a suite of algorithms that will automatically reconfigure the running simulation . . .

5

Theoretical Issues in Ergonomics Science Table 1. Some of the daunting variability of the world.


Capability

Readiness

Motivation

Best performing participant (or team)

High intelligence Highly proficient (expert) High aptitude High verbal ability

High alertness High preparedness High capacity for recognition-primed decision making

High intrinsic motivation Achiever attitude

Worstperforming participant (or team)

Inadequate training Low intelligence Low aptitude Low verbal ability

Low arousal, insufficient experience Has a headache

Low motivation Proceduralist attitude

What should be pointed out is that such statements are promissory notes, shown by the hefty reliance on the word ‘will’. Other words, such as ‘might’, would be more appropriate. Phrases such as ‘we hope will’ would be more honest. Organisations, teams and individuals who seek to create information technology invariably base their entire approach and design rationale on the designer’s gamble. The designer’s gamble is an assumption made during the processes of procurement. As such, it is a leverage point for empirical analysis and, in particular, testing hypotheses about the goodness of software tools. What follows from the designer’s gamble is a way around the fundamental disconnect.

4. Range statistics Range statistics might represent a fast-track solution that can address questions concerning the goodness of the cognitive work, on the assumption of the designer’s gamble. One can conduct studies of the same sort that are generally conducted now; one can still treat variance as a thing to be analysed in exploration of hypotheses about the cognitive work, using the familiar parametric statistical tests on data from experiments on human–computer interaction to probe the meaning of the variability (e.g. effects of co-located vs. distributed teams). In other words, people can go on doing pretty much what they do now when they study human–computer interaction and test new software. However, at the same time, range statistics can be pulled out and utilised as a fast-track probe of the goodness of the tools and adequacy of the work methods. This can be made concrete with an example – a hypothetical but realistic experiment.

4.1. An example Suppose that a humanitarian emergency response team is to be provided with new software that they carry on their laptops, to support the management and coordination of activities and the utilisation of resources. The new software is to replace their ad hoc work method (e.g. email, spreadsheets, etc.). The new software, identified by the obligatory catchy acronym, might have been created on the basis of deep and meaningful input from people who have actually conducted this sort of work and other considerations that are important to making technology usable, useful and understandable. Alternatively, it might not. Inevitably, however, there is some form of usability evaluation.


6

R.R. Hoffman et al.

One strategy is to conduct a study in which a group of experienced emergency responders is situated in a laboratory consisting of a number of separate workstations. Participants receive some initial practice trials in order to learn about the task (the displays, the commands and the work method shaped by the new technology). Using headphones, the group simulates a distributed team and conducts an emergency response management operation based on a simulated scenario derived from case studies. A comparison group does the same, but in this case all of the participants are situated in a single room and can communicate openly and directly with each other, simulating a co-located team. Through such an experiment, one could evaluate hypotheses about the effect of a variable of the cognitive work (distributed vs. co-located). Measures could be made of such things as time to first response or number of citizens evacuated per hour. One could probe the emergency responders’ moment-to-moment shared situational awareness, evaluate patterns in the team members’ communications with one another and take periodic ratings of mental workload (Endsley 1995, Megaw 2005). Now suppose that the new technology results in work that seems to be improved, relative to the current capabilities, but only for the co-located team. One might jump to the conclusion that co-location makes the difference. But the result might be because the team that was co-located simply happened to learn how to use the software better than the distributed team in the first place. To resolve this matter, one can imagine scores of controlled factorial experiments in the tradition of null hypothesis testing.

4.2. Where this example leads In much of the research on human–computer interaction, indeed in much psychological research on learning and performance, data are not collected during the initial instruction and practice trials. Any data that might be collected would not be of any particular interest with regard to the major hypotheses being studied in the main trials (an experiment involving controlled and manipulated variables). After all, the participants are just learning the basics of the task. For example, in studies of performance at dual-attention tasks, participants will be given practice at each of the dual tasks (e.g. reading letter strings to identify strings containing the letter D, plus pushing a button whenever a coloured patch appears in one of the four corners of the display). In such research, the participant’s performance at the practice trials is not of interest. Performance at practice trials is a neglected resource and a number of measures taken during practice trials could be informative. Often, as in the hypothetical example, the practice phase of the experiment involves presenting the participants with some fixed number of practice trials. In many studies, practice can involve training to a criterion level of performance, however many trials that might take, in order that the participants might be equated for degree of original learning. This strategy serves to address the possible confound in the interpretation of main effects and interactions of the independent variables that are formative of the main test trials. However, this strategy also sets the stage for using range statistics in order to evaluate work methods. Most experimental studies in the evaluation of new software tools involve certain basic steps. These often include: (1) instruction and practice trials; (2) main test trials; (3) subsequent debriefing. We propose a variation on this basic design. For the purposes of explanation, a generic design is illustrated in Figure 1. It will be this general design that will be referenced in the subsequent discussion.



7

Figure 1. A general experiment design.

5. Principal performance measures and range statistics There is a great variety of measurables, any of which may be important for a given project or cognitive work system. Hoffman et al. (2008) categorised these as: Gains (things that one would want to increase), such as intrinsic motivation, the achievement of proficiency and justified trust. Reductions (things that one would want to decrease), such as mental workload and negative hedonicity. Avoidances (things that one would want to mitigate), such as the need for kluges.

Sponsors who seek metrics are generally focused on principal performance measures that reflect efficiency (number of primary task goals achieved per unit time). The choice of a principal performance measure will depend on the domain, the specific tasks and other features of the cognitive work. Principal performance measures can be non-traditional, that is, attempt to capture cognitive work at higher levels (e.g. recovery of situational awareness following a multi-tasking distraction). The principal performance measure can be when ‘more is better’, but it can also be when ‘more is worse’. An example for traditional human performance measurement would be number of errors. An example for system-level measurement would be ‘number of communications requesting clarification’ as a measure of loss of situation awareness. A conceptual generic definition of a principal performance measure is number of principal task goals successfully accomplished per unit time at work. For example, performance using a new interface for the control of unmanned aerial vehicles might involve a principal performance measure for a sensor payload operator of ‘number of targets photographed per mission’ (see Hancock et al. 2005). Performance of a team in a simulated scenario might be measured by ‘number of hostages rescued without being observed’. A principal performance measure is thus used to form the criterion for training. So for instance, rather than having participants run through some fixed number of practice trials

8

R.R. Hoffman et al.


prior to the main test trials, each participant would be run through however many trials it takes to achieve some pre-specified criterion. Practice can involve training to a lax criterion or to a strict criterion (such as number of practice trials until performance achieves criterion at least two trials in a row). Any value can be used to set a criterion, and whether it is a liberal or a conservative one will depend upon the work system at hand and the nature of the individuals who are expected to operate it. For tasks having historical performance precedent, the criterion might be based on archived data, baselines of performance, legacy training standards or other prior empirical sources. For tasks having no legacy counterpart, and hence no previous baseline data, a criterion can be set based on informed judgement or (better still) some model of the work, a model that would come from envisioned world explorations and the application of methods of cognitive task analysis. The criterion would mature as evaluations are conducted.

6. The mathematics of range statistics for trials-to-criterion In his prospectus for usability analysis, Nielsen (1993, p. 26) suggested that one might examine the ‘novice user’s experience at the initial part of the learning curve’ (p. 28), expecting a steep incline and the achievement of a reasonable level of proficiency within a short time for highly learnable systems. Counts of trials to achieve (or re-achieve) criteria are highly unlikely to have a Gaussian distribution (Book 1908). Measures of trials-to-criterion are an instance of a process where values are being constrained by some sort of stopping rule. One would expect relatively few participants to achieve criterion on the first, second or even perhaps third trial. Were that to happen one would have to conclude that the cognitive work is trivial. However, one would expect to see many participants achieve criterion after more than a few trials. Such a distribution of ‘small numbers of small numbers’ typically would be highly skewed and have a ‘fat tail’. This characterises distributions such as the negative binomial, illustrated in Figure 2. For cases

Figure 2. A stylised distribution for trials to achieve or re-achieve criterion.



9

such as these, an order statistic (range or median) is preferred to an average because one is dealing with distributions where one can be certain beforehand that the average value will be misleading and unrepresentative (i.e. there is a considerable difference between the mean and the median) (Newell and Hancock 1984). Furthermore, one is not really interested in averages but entire distributions and especially the extremes (see also Nielsen 1993). From probability density functions, such as that derivable from the negative binomial, one can conduct statistical evaluations such as the determination of confidence intervals, as well as estimate the likelihood that any given participant would achieve criterion on any given trial. Additional calculations are possible, such as the likelihood that any given trial where criterion was achieved would be followed by a second trial where performance was at or above criterion. Such quantitative outcomes could then be employed to evaluate the main experimental hypotheses of the effects of independent variables of the cognitive work. What we focus on here, however, is the use of range statistics to evaluate the work method. Range statistics allow the ‘daunting variability of the world’ to become more readily understandable through analysis.

7. Learnability The ‘readability’ of software support documentation (as in Rockwell and Bajaj 2005) is useful in the assessment of learnability, but the approach we present involves getting directly at learnability measurement. Letting B stand for trials-to-criterion (or trials to re-achieve criterion) for the best performing participant and W stand for trials-to-criterion (or trials to re-achieve criterion) for the worst performing participant, the mid-range is (B þ W)/2. In this case of only two numbers, the mid-range is a form of average, but one is not interested in the mid-range because it may carry with it some of the properties of the mathematical average, as interesting and useful as those properties might be. Rather, one is interested in it because it preserves variability. The particular function of the mid-range that is of interest is the inverse of the mid-range, a new statistic that is denoted with the symbol i (pronounced ‘i-bar’), which equals 2/(B þ W). This function of B and W was chosen because it is a simple transformation of the mid-range and scales numbers such that they fall between zero and one. The i numerical scale can be interpreted as a conjoint measure, that is, it is a measure of more than one thing. If i is quite close to 1.00, the best and worst performing participants ‘got it’ within just a few trials. Either the cognitive work is trivial or the criterion has been set too low. As i gets closer to zero, one would conclude that the cognitive work is extremely difficult or that the criterion was set too high. Thus, the i scale can serve as a tool for fine tuning the criterion or guiding the selection of the learning trials cases (or problem tasks) of an appropriate degree of difficulty. Once one has reason to believe that the criterion is appropriate, then i can be interpreted as a scale of the ‘learnability’ of a work method. In this situation, if i is high, the designer won the gamble. If i is low, the designer lost. Details of this interpretation of i are presented in Table 2. This is just one interpretation and is based primarily on experience in the experimental psychology laboratory. In practice trials, if the best performer gets it right on the first and the worst performer gets it on, say, the third trial, then chances are the task is quite easy. If the worst performer does not ‘get it’ after nine or more trials, then chances are the task is fairly difficult.

10

R.R. Hoffman et al.

Table 2. An interpretation of the i scale. Example values (B, W) i

i Scale range "Range of trivial cognitive work

(1, 1,) 1.00 (1, 2) 0.66

#Range of non-trivial cognitive work

(1, 3) 0.50


#Range of (re)learnability

#Range of stretch

(1, (2, (2, (1, (2, (3, (3, (4, (2, (3, (5, (4,

4) 3) 4) 6) 5) 5) 6) 6) 9) 8) 7) 9)

0.40 0.40 0.33 0.29 0.26 0.25 0.22 0.20 0.18 0.18 0.16 0.15

Desired discrimination i between 1.00 to 0.66 suggests that the cognitive work may be trivial or that the performance criterion needs to be raised. Edge of the range. Criterion may still be set too low. Fine discriminability is desired.

Finest discriminability is desired.

In the range of stretch, the cognitive work might be extremely difficult, the work method might be very low in learnability, the criterion might have been set too high or some combination of these may be the case. In the case of extremely difficult cognitive work, differences in i at the second decimal place might be meaningful. An example might be helicopter training, where trainees receive hours of practice at the task of hovering a helicopter, taking an average of about 20 hours to receive approval to attempt solo flight (Still and Temme 2006). Often when people learn to fly an aircraft simulator, only after the first dozen or so practice trials, which usually result in crashes, does one begin to see trials where speed, altitude, heading, etc. are successfully maintained, even for ‘easy’ flights (McClernon 2009). Thus, i can be interpreted as a conjoint measure: reasonableness of the criterion; learnability; re-learnability; stretch. To augment the use of the statistic i, it is possible to find its probability distribution. This distribution depends on the choice of a probability density function to model the data (trials to achieve or re-achieve criterion). Once that is determined, the joint cumulative probability distribution function for B and W can be derived. From that, the convolution yielding the density of B þ W follows. Finally, the one-to-one transformation from the density of B þ W to that of i¼ 2/(B þ W) is straightforward. With such exact modelling, one can ask, for instance, what is the probability of an i of some value for trials to re-achieve criterion on the assumption that the i is from a distribution that has been determined to describe trials to criterion for initial learning? If that probability is low, one can conclude that the participants did not retain the original learning and the designer lost the gamble. If that probability is high, one can conclude that the participants not only ‘got it’ but that they also retained it and therefore conclude that the designer won the gamble. It is possible (though remotely so) that trials to criterion datasets might consist of values such as [1, 1, 1, 2, 3, 4, 8] vs. [1, 2, 3, 4, 8, 8, 8]. These two sets would have the


11

same i values. With regard to such possibilities, it is expected that researchers will do with i just what they do currently in statistical analysis; specifically, they will begin by looking at graphs. A frequency histogram of trials to criterion data would reveal the nature of the departure from the Gaussian form and, in the case of the ‘1s and 8s problem’, guide the interpretation of i-bar.


8. Resilience measures and team measures Resilience is the ability to recognise and adapt to unanticipated perturbations that might stretch the workers’ competence and demand a shift of processes and strategies (Hollnagel et al. 2006). For the study of resilience, one can adapt a method used in clinical trials. Once it is seen in clinical trials data that some treatment is effective, one can cease the experimental trials and start giving the treatment even to those who had previously been receiving either the control or the placebo treatment. In the present case, once there had accrued sufficient data to warrant conclusions about the learnability of a work method, there could be a session of resilience trials (rather than re-learning trials; see Figure 1), in which ‘the system’ is stretched. This can be achieved in a variety of ways. Voice loops to one team member might be delayed (simulating communication delays due to bandwidth issues). One might simulate a communication loss or a loss of team functionality. Performance could be evaluated by trials to re-achieve criterion. One assumes some reasonable level of learnability but now interprets the range as a reflection of the resilience of the work system and, likewise, interprets the i numerical scale as a measure of resilience. A team is obviously more than a simple aggregation of individuals. The measure that has been described can be applied in the analysis of work methods and technologies for teams and team cognitive work. For instance, rather than evaluating trials to achieve criterion on the part of the best (and worst) performing participants, one can evaluate i for the best (and worst) performing team. Measures can be of learnability, re-learnability and resilience with regard to the team cognitive work, the range statistics (the B and W numbers for trials to criterion) are pulled out from the training and retraining trials in order to evaluate hypotheses concerning the learnability or the goodness of the cognitive work while all of the main experiment data can be used in the evaluation of hypotheses concerning the cognitive work (e.g. the effect of some variable on team sense-making).

9. Conclusion and prospects This paper has presented an argument about the potential value of range statistics in the evaluation of cognitive work and has outlined a conjoint measurement scale. A next step is to model actual datasets data so as to answer questions about the likelihood of experimental outcomes. We are currently exploring exact solutions for modelling order statistics using the negative binomial, the logistic function and cumulative probability distributions. Those who are interested in metrics for genuine evaluation of cognitive work are invited to consider the general proposal that: . Variability is something to explore, not average away. . The designer’s gamble might be a leverage point for the evaluation of software-supported cognitive work. . It is possible to measure the learnability of software-supported cognitive work and perhaps even do so very easily.

12

R.R. Hoffman et al.

This proposal is, admittedly, rather bold. That by adopting a simple aspect of experimental design (trials to criterion during the initial practice or training phase) one can proceed as usual with studies of human–computer interaction and at the same time spin off two numbers, just two, that might provide a means of evaluating the learnability of the work method, that is, the goodness of the software tools. Some may feel uncomfortable with the attempt to merge identity and difference into a single test statistic: the i scale, interpreted as a measure of learnability (or re-learnability). The present authors believe that this novelty is one of the things that makes the i concept interesting. The strategy is:


(1) to provide the kind of quantitative measure (or ‘metric’) that is frequently desired by sponsoring agencies, that is, a performance efficiency measure; (2) to apply it in such a way so as to support traditional hypothesis testing; (3) while at the same time serving as a fast track for the evaluation of the goodness of the technology. Another aspect of this strategy is to make it easy to bring in systems considerations that go beyond traditional approaches to performance measurement. The present authors are confident that range statistics are appropriate for the analysis of the learnability of cognitive work methods. Letting the ‘daunting variability of the world’ be the focus of the evaluation will suggest new approaches to measurement of cognitive work, despite this general notion being apparently at odds with (but complementary to) traditional approaches that involve the assumption of the experimental paradigm that one must control sources of variability and the related reliance on parametric statistical analysis assuming Gaussian distributions. It is also hoped that the extension of the basic measurement idea to resilience will serve to bring genuine systems-level thinking in to the analysis of performance. As society’s technological sophistication and complexity grow, to create a science of complex cognitive systems one must possess valid, reliable and meaningful measures of system-level performance, including measures pertinent to human-centring. Significant empirical and theoretical issues remain to be resolved. What will be the procedure for using this approach? What are the properties of i-bar related to statistical inference? The discrimination of bogus from true outliers is an important and interesting problem, and not likely to be solved solely by mathematical analysis. It is one of the things we are interested in pursuing as we explore data sets. This discussion has been premised on the analysis of data based on some single measure on which trials-to-criterion is scaled. Though a single measure might be compound or conjoint, what if one sees value in using multiple measures and employing multiple criteria? Might there be some comparison of distributions allowing evaluation of the relative informativeness of range statistics with regard to learnability? How would that take into account the fact that a measure might be a straightforward measure of performance (e.g. hits per unit time on task) or a higher-level measure (e.g. recovery from an incorrect interpretation of situational cues)? Exploration of these and other questions will likely require an effort of considerable scale. The model underlying trials-to-criterion data must be researched. For that reason, researchers are invited to consider adopting the trials to criterion control methodology and share their data to support a process of generating a suite of models and the associated joint cumulative distributions functions for them. What is offered here is not a closed-end solution. Rather, it is a first step or a prospectus in a challenging journey to re-think, document and quantify the character and


13

capacities of large-scale interacting human–machine systems. The present authors seek to enable researchers to model ‘small numbers of small numbers’ in a way that permits the usual course of controlled factorial experimentation to test hypotheses about cognitive work and at the same time provides that fast track into the analysis of the goodness of new work methods and the associated software tools for conducting cognitive work in complex sociotechnical work systems.

Acknowledgements


The authors would like to thank the two anonymous reviewers for their supportive comments and their suggestions for improvement. The contribution of the first and third authors was through participation in the Advanced Decision Architectures Collaborative Technology Alliance, sponsored by the US Army Research Laboratory under Cooperative Agreement DAAD19–01–2-0009.

References Ballas, J.A., 2007. Human centered computing for tactical weather forecasting: An example of the ‘Moving Target Rule’. In: R.R. Hoffman, ed. Expertise out of context: Proceedings of the sixth international conference on naturalistic decision making. Boca Raton, FL: CRC Press, 317–326. Baron, R.M. and Kenny, D.A., 1986. The moderator-mediator variable distinction in social psychological research: Conceptual, strategic and statistical considerations. Journal of Personality and Social Psychology, 51, 1173–1182. Book, W.F., 1908. The psychology of skill with special reference to its acquisition in typewriting. Studies in psychology. Vol. 1, Missoula, MT: University of Montana Press. Chronbach, L., 1975. Beyond the two disciplines of scientific psychology. American Psychologist, 30, 116–127. Endsley, M.R., 1995. Measurement of situation awareness in dynamic systems. Human Factors, 37, 65–84. Firestone, W.A., 1993. Alternative arguments for generalizing from data as applied to qualitative research. Educational Researcher, 22, 16–23. Goguen, J., 1997. Towards a social, ethical theory of information. In: G. Bowker, L. Gasser, L. Star and W. Turner, eds. Social science research, technical systems, and cooperative work. Hillsdale, NJ: Lawrence Erlbaum Associates, 27–56. Grudin, J., 1992. Utility and usability: Research issues and development concepts. Interacting with Computers, 4, 209–217. Hancock, P.A., Pepe, A.A., and Murphy, L.L., 2005. Hedonomics: The power of positive and pleasurable ergonomics. Ergonomics in Design, Winter, 8–14. Hewish, M., 2003. Out of CAOCs comes order. Jane’s International Defense Review, May, 22. Hoffman, R.R., 2008. Influencing versus informing design 2: Macrocognitive modeling. IEEE: Intelligent Systems, November/December, 86–89. Hoffman, R.R. and Deal, S.V., 2008. Influencing versus informing design, Part 1: A gap analysis. IEEE: Intelligent Systems, September/October, 72–75. Hoffman, R.R. and Elm, W.C., 2006. HCC implications for the procurement process. IEEE: Intelligent Systems, January/February, 74–81. Hoffman, R.R., Marx, M., and Hancock, P.A., 2008. Metrics, metrics, metrics: Negative hedonicity. IEEE: Intelligent Systems, March–April, 69–73. Hoffman, R.R. and Woods, D.D., 2005. Steps toward a theory of complex and cognitive systems. IEEE: Intelligent Systems, January/February, 76–79. Hollnagel, E., Woods, D.D., and Leveson, N. eds., 2006. Resilience engineering: concepts and precepts. Aldershot, UK: Ashgate.


14

R.R. Hoffman et al.

Jenkins, J.J., 1974. Remember that old theory of memory? Well, forget it! American Psychologist, 29, 785–795. Kiess, H.O. and Bloomquist, D.W., 1985. Psychological research methods: A conceptual approach. Boston: Allyn & Bacon. Koopman, P. and Hoffman, R.R., 2003. Work-arounds, make-work, and kludges. IEEE: Intelligent Systems, November/December, 70–75. Laplante, P., Hoffman, R. and Klein, G., 2007. Antipatterns in the creation of intelligent systems. IEEE Intelligent Systems, January/February, 91–95. McClernon, C.K., 2009. Stress effects on transfer from virtual environment flight training to stressful flight environments. Dissertation (Doctoral). The Modeling, Virtual Environment, and Simulation Institute, Naval Postgraduate School, Monterey, CA. Megaw, T., 2005. Definition and measurement of mental workload. In: J.R. Wilson and E.N. Corlett, eds. Evaluation of human work. Boca Raton, FL: Taylor & Francis, 525–551. Miller, G.A., 1986. Dismembering cognition. In: S.H. Hulse and B.F. Green Jr, eds. One hundred years of psychological research in America. Baltimore, MD: Johns Hopkins University Press, 277–298. Neville, K., et al., 2008. The procurement woes revisited. IEEE Intelligent Systems, January/ February, 72–75. Newell, A., 1973. You can’t play 20 questions with nature and win: Projective comments on the papers of this symposium. In: W.G. Chase, ed. Visual information processing. New York: Academic Press, 283–309. Newell, K.M. and Hancock, P.A., 1984. Forgotten moments: Skewness and kurtosis are influential factors in inferences extrapolated from response distributions. Journal of Motor Behavior, 16, 320–335. Nielsen, J., 1993. Usability engineering. San Diego, CA: Academic Press. O’Neill, M.J., 2007. Measuring workplace performance. 2nd ed. New York: Taylor and Francis. Rockwell, S. and Bajaj, A., 2005. COGEVAL: Applying cognitive theories to evaluate conceptual models. In: K. Siau, ed. Advanced topics in database research. Vol. 4, Hershey, PA: Idea Group Publishing, 255–282. Rosson, M.B. and Carroll, J.M., 2002. Usability engineering. San Francisco, CA: Morgan Kaufmann. Schaffer, E., 2004. Institutionalization of usability. Boston: Addison-Wesley. Shepherd, A., 2001. Hierarchical task analysis. London: Taylor and Francis. Shneiderman, B., 1998. Designing the user interface. 3rd ed. Upper Saddle River, NJ: Addison Wesley. Still, D.L. and Temme, L.A., 2006. Configuring desktop helicopter simulation for research. Aviation Space and Environmental Medicine, 77, 323. Vicente, K., 1999. Cognitive work analysis. Mahwah, NJ: Lawrence Erlbaum Associates. Woods, D.D., 1998. Designs are hypotheses about how artifacts shape cognition and collaboration. Ergonomics, 41, 168–173.

About the authors Robert Hoffman is Senior Research Scientist at the Institute for Human and Machine Cognition in Pensacola, Florida. Among his many interests are the psycholinguistics of figurative language, the psychology and history of science, the cognition of experts (especially weather forecasters), humancentred computing, and geological field research in planetary science. His major current projects are the development of a general theory of macrocognitive work systems, and theories of complexity and measurement in systems analysis. Morris Marx is currently Trustees Professor and President Emeritus of the University of West Florida. He received his BS, MS and PhD from Tulane University. His research interests are


15

mathematical statistics, graph theory, and topology. He is the author with Richard Larsen of An introduction to mathematical statistics and its applications (4th ed.) (2005). Raid Amin is currently a Professor in the Department of Mathematics and Statistics at the University of West Florida. He is Director of the University of West Florida Consulting Services. He received his BA in Statistics from Baghdad University and his Master’s and PhD degrees from Virginia Tech. His research interests include statistical process control, statistical computing, assessment of learning in mathematical sciences, and epidemiology and cancer clustering. His research appears in Journal of Quality Technology, Technometrics, Sequential Analysis, and other publications.


Patricia McDermott is a Lead Human Factors Engineer at Alion Science and Technology. She is Program Manager of the Army Research Laboratory’s Advanced Decision Architectures Collaborative Technology Alliance, the Army’s largest basic research program in cognitive and computer science. Ms McDermott received a BS in Psychology and MS in Human Factors Engineering from Wright State University in Dayton, OH. Her research interests include naturalistic decision making, human–robot interaction, and training.

Measurement for evaluating the learnability and ... - Semantic Scholar

Measurement for evaluating the learnability and ... - Semantic Scholar

Suggest Documents

Measurement for evaluating the learnability and ... - Semantic Scholar

Learnability and the Doubling Dimension - Semantic Scholar

Planar Languages and Learnability - Semantic Scholar

Cultural selection for learnability: Three principles ... - Semantic Scholar

Methodology for Evaluating Quality and ... - Semantic Scholar

Challenges and Resources for Evaluating ... - Semantic Scholar

Biomarkers and echocardiography for evaluating ... - Semantic Scholar

Exploring learnability between exact and PAC - Semantic Scholar

Evaluating Thesaurus Alignments for Semantic ... - Semantic Scholar

Evaluating Measurement Invariance in the Measurement of ...

measurement principle and equipment for ... - Semantic Scholar

Performance Instrumentation and Measurement for ... - Semantic Scholar

Evaluating Pharmacokinetic and ... - Semantic Scholar

Energy Measurement for the Cloud - Semantic Scholar

Design Optimization for the Measurement ... - Semantic Scholar

Semantic Confidence Measurement for Spoken ... - Semantic Scholar

Mathematics and Measurement - Semantic Scholar

Evaluating the measurement reliabilities and ...

Measurement and Characterization of the ... - Semantic Scholar

Toward the Conceptualization and Measurement ... - Semantic Scholar

Learnability and linguistic performance

Combinatorial Testing Tool Learnability in an ... - Semantic Scholar

Evaluating subsampling strategies for sEMG ... - Semantic Scholar

EVALUATING THREAT ASSESSMENT FOR ... - Semantic Scholar