Knowledge-Acquisition Interfaces for Domain Experts: An Empirical Evaluation of Protégé-2000. Natalya Fridman Noy, William Grosso, Mark A. Musen. Stanford ...
Knowledge-Acquisition Interfaces for Domain Experts: An Empirical Evaluation of Protégé-2000 Natalya Fridman Noy, William Grosso, Mark A. Musen Stanford Medical Informatics Stanford University Stanford, CA {noy, grosso, musen}@smi.stanford.edu Abstract Application experts need to be able to maintain, populate, and verify existing knowledge bases in order to use knowledge-based tools to perform their daily tasks. Protégé-2000 is a tool that enables experts to perform the knowledgeacquisition and maintenance process. In this paper, we describe a controlled study that we performed to measure both the efficiency and quality of knowledge acquisition by domain experts and to evaluate how well the domain experts can retain their proficiency in using a knowledge-acquisition tool. We adopted the techniques that are used to evaluate usability of traditional software products to evaluating knowledge-acquisition process. Using Protégé-2000 and its domainspecific extensions, the domain experts with no previous experience with knowledge-based systems were able to perform complex knowledge-acquisition tasks efficiently and correctly. Our data show that subject-matter experts were able to use both tools after only 1-2 hours of training, and entered knowledge correctly overall 94% of the time. The custom-developed user interface greatly enhanced the ability of experts to identify known errors in a knowledge base (93% vs. 48% detection). The experts also were able to retain their skills over 1-2 week period of time without using the system.
1
“Studying AI systems is not very different from studying moderately intelligent animals such as rats” — Paul R. Cohen (Cohen 1995)
1. Using Protégé-2000 to maintain knowledge bases of constraints The use of knowledge-acquisition systems usually has been the realm of trained knowledge engineers or at least users with substantial computer experience. However, the ontologies and knowledge bases that drive modern intelligent systems need to be maintained. Menzies (Menzies 1999) argues that “we must move the focus of knowledge engineering from knowledge acquisition to knowledge maintenance. ” If participation of system analysts at the stage of knowledge-base development is feasible (if not always desirable), domain experts themselves must perform knowledge-base maintenance, quality control, and update. Professionals want to be responsible for the knowledge-based systems used in their application area. However, most of the tools for knowledge-base development and maintenance, which exist today, are geared toward not just the extremely computer-literate user, but toward the computer-science-educated user (Duineveld et al. 1999). In order to allow domain experts to use knowledge-base construction and maintenance tools (hereafter referred to as knowledge-base editors) the developers of these tools need to make the tools easier to use. The impediments to developing user-friendly knowledgebase editors are enormous: not only are domain experts not used to thinking in terms of ontological categories but also, in most cases, they will be only infrequent users of knowledge-based editors. Therefore, we cannot rely on frequent and prolonged use of the tools to reinforce user skills. 2
In this paper, we concentrate on the methodology and design of usability experiments to test the value of general and domain-specific knowledge-acquisition tools. We adopted usability-testing approaches used in traditional software to test the quality of knowledgebased systems. We describe a knowledge-acquisition tool for experts in a military domain that we implemented by extending Protégé-2000—an ontology-design and knowledgeacquisition tool developed in our laboratory (Grosso et al. 1999). We present the results of the experiments and the lessons that we learned from observing domain experts as they tried to maintain and edit a knowledge base. In this experiment, we tested the following two hypotheses: (1) Protégé-2000 enables domain experts to repair, improve, and extend knowledge bases. (2) Domain-specific extensions (“plugins”) offer a cost-effective and intuitive way to improve the rate of knowledge acquisition and the quality of the acquired knowledge. We performed the experiments described in this paper as part of the knowledgeacquisition critical-component experiment in DARPA’s High-Performance Knowledge Bases (HPKB) project (Cohen et al. 1999). In Section 2, we describe Protégé-2000 and its domain-specific extension for representing constraints of military units. We describe the experimental procedure, including tasks and materials in Section 3. The results of the experiment are given in Section 4. Section 5 contains our analysis of the results and discussion. Section 6 concludes the paper.
3
2. Protégé-2000 and the domain-specific plugins 2.1. Protégé-2000—A generic knowledge-acquisition tool Protégé-2000 is an ontology-development and knowledge-acquisition tool that is designed to make it easier for domain experts to maintain and edit knowledge bases (Grosso et al. 1999). The system represents the latest installment of a series of such tools developed in our laboratory (Musen et al. 1993; Gennari 1998). In Protégé-2000, a class hierarchy is represented visually as a tree, class definitions are represented as forms, and users acquire instances by filling out the forms. Protégé-2000 generates default instance forms based on the types and cardinality of the slots. Figure 1 shows the screen for navigating and acquiring instances in Protégé-2000. The left pane displays the class
Figure 1. Representing the ontology of unit constraints in Protégé-2000
4
hierarchy, the middle pane shows the list of instances for the selected class, and the right pane contains the form for the selected instance. Protégé-2000 is an extensible framework in which users can develop their own task- and domain-specific plugins that access a Protégé knowledge base and use Protégé graphical user-interface elements. We designed the HPKB tab—an extension to Protégé-2000 which helps domain experts manage large knowledge bases containing information about constraints on military units (mainly their organizational structure).
2.2. HPKB Tab— Providing a Summary View of a Large Number of Instances For the HPKB project, we needed to represent a large knowledge base (about 1500 frames) of constraints on military units. The constraints contain the information about the typical organization of opposing-force units. For example, if a commander knows that he is facing a tank army, he needs to know what subunits that army may contain. The US Army publishes a manual (ST 100-7) containing information on the typical organization of opposing forces. However, the information in that manual is often outdated or inaccurate; the world changes and the organizational structure often depends on the nationality of the enemy. For example, Iraqi armies are organized differently than Soviet armies. Joint expeditions between the two forces may reflect yet another structure. Therefore, the military experts need to maintain and update information encoded in the knowledge base (and possibly used in knowledge-based systems analyzing battle plans) on a regular basis. We designed an extension to Protégé-2000 to represent the unit organization structure— the HPKB tab. The extension allows users to examine visually the information in the
5
large knowledge base, edit the data, determine which information is incomplete or missing, and where more work is required. Figure 2 shows a screenshot of the HPKB tab. We represent a large collection of constraint instances as a sparse matrix. Columns are different echelons (army, division, brigade, etc.) and rows are different types of units (infantry, tank, etc.) organized in a hierarchy. Information in a single cell of this table represents constraints on a particular type and echelon of a unit. For example, in Figure 2, we see that a mechanized-infantry brigade contains exactly one headquarters and exactly three mechanized-infantry battalions.
Figure 2. The unit constraints ontology in HPKB tab
6
We use color in the matrix to visualize several facts inferred from the ontology: •
Can an instance in a particular type and echelon (that is, cell in the HPKB tab) exist at all? For example, information in the ontology disallows the existence of a Medical army.
•
Have any instances for a particular unit type and echelon been created (and how many)? For example, in Figure 2 a mechanized-infantry brigade can have several possible compositions—several instances corresponding to one cell in the table. The second number in the cell indicates how many instances exist.
•
How many instances has the user verified as being correct and complete? Editing a knowledge base could take a substantial amount of time, and the user will almost certainly be interrupted in the middle of doing so. It is useful for a user to know if he completed filling in an instance or if he stopped in the middle.
Other features of the HPKB tab that distinguish it from the Protégé-2000 itself include: •
natural-language summaries for chart elements in domain-specific terms
•
natural-language warnings if the information that the user provided is incorrect or incomplete (for example, if the minimum allowed number of units is greater than the maximum number)
•
automatic fill-in of slot values based on the position in the table
•
multi-slot browser keys: combination of several slots in way that is meaningful for this domain in order to identify uniquely the instance in a list of instances
7
3. Experiment Design We evaluated Protégé-2000 and the HPKB tab during a knowledge-acquisition experiment at the Battle Command Battle Lab at Ft. Leavenworth, Kansas. We performed an ablation experiment in order to determine the value of the enhanced user interface of the HPKB tab: One group of 2 subjects used Protégé-2000 and the second group of two subjects used Protégé-2000 augmented with the HPKB tab to perform the same sets of tasks. The subjects were majors in the US Army, with no previous experience in using knowledge-based systems. Their computer experience was limited by the use of traditional office software. In order to minimize the influence of individual-subject differences on the results, we performed a blocked randomization study in which two subjects spent one day working with native Protégé-2000 and then spent the second day working with Protégé-2000 augmented with the HPKB tab. The two other subjects in the study worked with the two systems in the reverse order (Figure 3). On each of the two days, the subjects received up to 90 minutes of structured instruction about the particular knowledge-acquisition system that they would be using that day. We
Group 1
Group 2
Day 1: use Protégé-2000 Day 1: use HPKB tab Day 2: use Protégé-2000 Day 2: use HPKB tab Skill-retention test: 1-2 weeks later Day 3: use HPKB tab
Day 3: use Protégé-2000
Figure 3. Experiment setup
8
then asked the subjects to perform seven clearly defined, standardized tasks that required them to inspect a knowledge base, to add new concepts, to identify errors, and to correct those errors. The subjects performed these tasks without any supervision or help; research staff were available only to clarify task statements. An important aspect of our experiment involved testing the skills retention of the subjects. Therefore, we prepared a third sequence of tasks and asked the subjects to complete these tasks one-to-two weeks following the initial experiment, working alone in their offices (see Figure 3). There was no additional training preceding the skillsretention experiment and the subjects did not use the tool at all between the main experiment and the skills-retention one. Given that maintaining knowledge bases is not the primary everyday activity of the domain experts, and that direct use of the knowledge bases will only occur sporadically, we believe that it is an essential component of usability testing to determine whether subjects can still interact with a computer system long after the initial training session.
3.1. Tasks We designed three sets of tasks for the three days of the experiment. There were two main design criteria: (1) in order for the results to be comparable, the three sets of tasks should be very similar to one another in the level of difficulty and the amount of work required to complete all the tasks in the set, and (2) within the set of tasks for a given day, the first tasks should be easy and subsequent tasks should get progressively harder. This second criterion was adopted in order to reduce the “frustration level”; if a subject gets
9
Verify Verifythat thatall all Artillery Artillerysubunits subunitsof of Mechanized Mechanized Infantry Infantry Brigade Brigade(IFV)(DIV) (IFV)(DIV)have have their their organization organizationchart chart specified. specified. You MechanizedInfantry Infantry You need needto to verify verifythat that each eachartillery artilleryunit unit mentioned mentionedin in the the chart chart for forMechanized has its own chart defined. All subunits of other types Brigade (IFV)(DIV) arenow nowfully fully Brigade (IFV)(DIV) has its own chart defined. All subunits of other types are specified specifiedand and you youdo donot notneed needto toverify verifythis thisfact. fact.Only Onlystudy studythe theartillery artillery subunits. subunits. For For each each artillery artilleryunit unit that thatdoes doesnot nothave have the the chart chart defined, defined, or or does does not not have have itit checked checked (that (that is, is, ititmay may be benot not fully fullyspecified), specified), create create or or complete complete the the chart. chart.
Figure 4. Typical task formulation the most difficult task first and fails, his motivation to perform any of the remaining tasks may be very low (Nielsen 1994). In addition, “clean” knowledge bases were provided for each task to avoid compounding errors. Tasks were described in domain specific terms, corresponding to the way subjects would think about such tasks in the course of their daily activities. That is, the task descriptions referred to military units, weapons, and other domain terms and did not mention knowledge-representation terms such as classes and instances, or user-interface terms, such as tabs and buttons. Figure 4 shows an example of a task statement.
3.2. Evaluation criteria We used the following evaluation criteria to measure subjects’ performance on the tests: •
Knowledge-acquisition rate: We measured the number of knowledge-base changes per unit of time that the subjects completed.
•
Ability to find errors: For several tasks, we introduced intentional errors into the knowledge base and asked the subjects to find and correct the errors. We measured the number of errors that the subjects were able to find relative to the number of errors that we introduced.
10
•
Quality of knowledge entry: We measured quality of knowledge entry in two different ways: (1) quality of the final result—how many errors appeared in the resulting knowledge base; and (2) errors during the task execution—how many of the knowledge-base changes that the subjects made were correct and how many were not correct.
•
Error-recovery rate: We measured how many of their own errors the subjects noticed before they completed the task and how long it took them to correct the errors that they noticed.
•
Retention of skills: For each of our evaluation criteria, we measured how well the subjects were able to retain their skills after not using the Protégé tool for a 1-2 weeks (and without additional training sessions).
•
Subjective opinion: We gave the subjects usability questionnaires at the end of each day, asking for their opinions on that day’s tools and, on the second day, asking them to compare the two different tools that they had used (Rubin 1994).
3.3. Data collection We collected the knowledge-acquisition rate (KA rate) data from the Protégé logs for the experiment. The absolute value for KA rate is not extremely meaningful because good baselines do not yet exist. But the relative KA rate in various stages of the experiment, with different tools, and for different subjects, is a good indicator of whether the subjects were getting more familiar with the system, forgetting how to use it, or working much faster using one tool or the other. When measuring the KA rate, we only counted the
11
number of knowledge-base changes (KB changes) per unit of time (and did not consider whether the changes were correct or not). Each of the following KB changes counts as a single operation when computing the KA rate: creating a new term (class or instance), deleting a term (class or instance), changing a slot value, and removing a slot value. The KA rate also includes automatic KB changes from this list that Protégé makes based on user’s actions. In fact, only the HPKB tab incurred automatic changes, such as filling in the unit type and specialty, changing the name of a unit to a meaningful one and so on.
4. Results
4.1. Rate of knowledge acquisition In Figure 5 we compare the KA rate for the subjects using the HPKB tab and the generic version of Protégé-2000. The graph shows the average value for all days and all subjects for the first six tasks (The graph does not include task 7 because the task did not involve using the HPKB tab and the subjects performed the task only on the day on which they were using the generic Protégé version). For all the tasks, except one, the subjects who were using the enhanced user interface were able to acquire knowledge faster than the subjects who were using the basic Protégé-2000 system. The only task in which the results were the opposite was task six; task six was the only task during which the subjects only entered data and did not need to study the knowledge already in the knowledge base. This result indicates that the HPKB tab is beneficial for analyzing the contents that are in the knowledge base already, but it is not as useful when entering new data. 12
HPKB Tab vs. Protege-2000
KB changes/min
6 5 4
HPKB Tab
3
Protégé-2000
2 1 0 1
3
5
Average
Task
Figure 5. Average knowledge-acquisition rate for all days and all subjects Figure 6 contains the KA rate data for all the three days of the experiment – days one and two of the main experiment and day three for the retention-of-skills experiment. When the subjects moved from the Protégé-2000 (the tool that is harder to use) to the HPKB tab (the enhanced user interface which is easier to use), their KA rate almost doubled. On the other hand, the subjects who first became familiar with the data and many aspects of the system by using the easier tool (HPKB tab) did not slow down on the second day when they had to use the more complicated tool (Protégé-2000).
Group 2, Progress from Day 1 (using HPKB Tab) to Day 2 to Day 3 (using Protégé-2000)
Group 1, Progress from Day 1 (using Protege-2000) to Day 2 to Day 3 (using HPKB tab) 5 4
4 Day 1, Protégé-2000
3
Day 2, HPKB Tab 2
Day 3, HPKB Tab
1
KB changes/min
KB changes/min
5
Day 1, HPKB Tab
3
Day 2, Protégé-2000 2
Day 3, Protege-2000
1
0
0
Average
Average
Figure 6. Knowledge-acquisition rate for each group for each of the three days (day 3 is 1-2 weeks later)
13
In the skills-retention experiment, the KA rate for the subjects in one group did go down after a break of one-to-two weeks, but they still performed better than they did on the first day of using the system (see Figure 6). The KA rate for the second group remained virtually unchanged after the break.
4.2. Ability to find errors For several tasks, we intentionally introduced errors into the source knowledge bases and asked the subjects to find and fix the errors. Since we knew how many errors and which errors the knowledge base contained at the beginning of the task, we could measure the fraction of those errors that the subjects were able to find. These tasks tested the subjects’ ability to navigate around the knowledge base—recall that the knowledge base had around 1500 interconnected frames and the subjects had to navigate through the knowledge base to find the areas where the errors could be and correct them. Figure 7 shows the results. On the first day, with minimal experience with the tool and low familiarity with the knowledge base and its organization, the subjects using the HPKB
Errorsfound/total found/totalerrors errors Errors
% % of oferrors errorsfound foundon onDay Day1-3 1-3(Tasks (Tasks1-5, 1-5,Subjects Subjects1-4) 1-4) 11 0.8 0.8 0.6 0.6
HPKB HPKBTab Tab Protégé-2000 Protégé-2000
0.4 0.4 0.2 0.2 00 11
22
33
Day Day
Figure 7. Ability to find errors
14
tab found 90% of the errors. The subjects using the generic version of Protégé-2000 were able to locate 81% of the errors.
4.3. Quality of knowledge entry
4.3.1
Quality of the final result
In order to evaluate the quality of the knowledge entered, we studied the knowledge bases that the subjects produced. We used the manual on the organization of the opposing-force units as the “gold standard” to determine the correctness. Even though the manual had outdated and incomplete information in it, for our purposes it provided an acceptable “gold standard”: we asked the subjects to assume that the information in the manual was indeed correct. If they could work with the information that was currently in the manual, they could also work with the information presented in the same form in other sources. Figure 8 shows the correctness results for the task that involved entering a large amount of data (from 70 to 110 new terms or relations). The correctness rate ranged from 55% to 92% on the first day. However, on the second day and 1-2 weeks later, the correctness rate was uniformly above 97%.
4.3.2
Errors during the task execution
We measured the proportion of wrong KB changes for all tasks and all subjects. The number of wrong KB changes includes the errors that the subjects later recovered from. Only 7% of the subjects’ steps were wrong. We believe that in fact one of the main advantages of Protégé in general and the HPKB Tab in particular is the extremely low
15
error rate of knowledge acquisition. Here Protégé benefits the most from structured knowledge acquisition.
Error recovery We discussed earlier (see Section 4.2) how successful the users were in finding errors in the portions of the knowledge base that the tasks guided them to. There is another type of error correction that happens during the KA process: when users are entering the data, they make mistakes. Sometimes the users notice their own mistakes and attempt to recover from them. On average, the subjects noticed and recovered from 32% of errors they have made along the way. That is, out of the 7% of wrong steps that we mentioned earlier, the subjects have later recovered from almost one third of them.
4.4. Skill retention We referred to the results of the retention of skills experiment throughout this section. To summarize, the experiment demonstrated that the subjects were able to retain their skills
Error rate change from Day 1 to Day 2 to Day 3
1 Error rate
0.8
Day 1 (108 entries)
0.6
Day 2 (78 entries) Day 3 (70 entries)
0.4 0.2 0 1
2
3
4
Subjects
Figure 8. Quality of the resulting knowledge base
16
after a break of one-to-two weeks, with no additional training after the break, and with no representatives of the research staff to answer their questions: They performed the task in their offices without ability to ask for help with the tasks or the tool. The KA rate remained virtually unchanged for one of the subjects group, and was still better than the KA rate on the first day for the second group of subjects. The quality of the knowledge entry did not deteriorate at all. The subjects were able to find between 72% and 82% of errors that we introduced into the knowledge base. The data that they entered was more than 97% correct.
5. Discussion
5.1. Testing the hypotheses The main result of the experiment was the demonstration that domain experts, with limited computer experience, and no artificial-intelligence or general computer-science knowledge were able to use both Protégé-2000 and the enhanced domain-specific version to enter large amounts of complicated, highly interconnected data. In addition, they were easily able to find and correct errors in a very large knowledge base. They were able to do so after extremely short training sessions (90 minutes each morning, before starting a set of exercises). Moreover, they were able to retain the skills they acquired during the first phase of the experiment, and to perform a similar set of tasks a week or two later, with little loss in productivity and no loss in the quality of the knowledge they entered. The results demonstrated that domain-specific user interfaces are useful, in particular at the earlier stages of using the tools. Our results show that the knowledge-acquisition rate
17
is higher with the enhanced domain-specific user interfaces, such as the HPKB Tab. The difference in the knowledge-acquisition rate was particularly high on the first day of using the tool. Therefore, if KA tools are intended for occasional short-term use, adding extra features that reflect better the nature of the domain would seem to be a worthwhile investment. Once the subjects became familiar with the knowledge base and the tool, regardless of the tool, they were able to perform almost perfectly on a complicated task involving many types of highly interrelated information. The effect of the enhanced user interface on the quality of knowledge entry was much less dramatic. Since Protégé-2000 itself provides a forms-driven, structured knowledgeacquisition facility, knowledge entered using Protégé-2000 is usually of very high quality. The forms "lead" users through the necessary steps, for example, making sure that they enter only the data of the allowed types, and giving immediate feedback on the data that the user just entered. Our study was also limited by the resources that were available. Four subjects performed the experiments. We would have liked to have more subjects and to be able to randomize the order in which the tests where given. If we had more subjects, we would have created explicit control groups and compared the performance among these groups, rather than using experimental groups as their own controls. Each group of subjects would have then worked on a single tool, minimizing any potential learning effects that might have skewed the results. However, we assumed that the effect of individual differences with such a small number of subjects was a more likely source of errors, and decided to perform the block-randomized tests.
18
5.2. Evaluating knowledge-acquisition techniques Software engineering metrics require three classes of measurements for any experiment (Fenton 1991): 1. Product measurement: in the case of knowledge-based systems, the product is which is the resulting knowledge base itself 2. Process measurement: in the case of knowledge-acquisition system, such metrics as KA rate, wrong and correct steps or knowledge-base changes, error recovery rate, and so on. 3. Resource measurement: the subjects, personnel and so on, that were required for the study The statistical and measurement requirements are particularly hard to satisfy for the knowledge-acquisition systems, unlike general software products (Shadbolt et al. 1999). Obtaining a substantially large and diverse sample of subjects with similar background requires significant amounts of resources. Getting the subjects for KA experiments is even harder when the tool is geared towards professionals in fields other than knowledge management, who may be less motivated to test the tools. Assessing the quality of the final product (product measurement) means assessing the quality of knowledge. Menzies (Menzies 1999) gives an overview of several experiments to evaluate knowledge acquisition that have been performed in recent years. He points out that none of the KA systems evaluations performed so far (such as the Oak Ridge study (Barstow et al. 1983), HPKB project (Cohen et al. 1999), or the (Schreiber and Birmingham 1996) project ) provides a general framework for evaluating the quality of
19
the knowledge that was produced in the knowledge-acquisition process. The solution that we proposed in this paper is a study with a gold-standard to use as benchmark, ablation of parts of the KA system to factor out the value of those parts, and block-randomization to factor out individual differences.
6. Conclusions We designed a controlled study to evaluate the usability of a knowledge-acquisition system. We demonstrated the ability of military experts—who have no experience in knowledge acquisition or computer science—to use our tools to acquire knowledge quickly and reliably. The experiment was an ablation experiment to measure the impact of enhanced user interfaces on the rate and quality of knowledge acquisition. We measured how well the subjects retained their skills in using the tools by giving them a new set of tasks a week after the initial experiment. The experiment shows that detailed measurable experiments of artificial-intelligence systems in general, and knowledgeacquisition systems in particular, are possible and produce quantifiable results that can be compared with results of other experiments. Our results document the ability of subjectmatter experts to work independently on a complex knowledge-entry task, and highlight the importance of an effective user interface in enhancing the knowledge-acquisition process.
Acknowledgments We thank the Battle Command Battle Lab for organizing the experiments, providing the test users, and handling the logistics. We are very grateful to Ray Fergerson for readily
20
extending Protégé with features that were required for the experiments. We are indebted to Mike Hewitt for participating in the pilot experiment. This work was funded in part by the High Performance Knowledge Base Project of the Defense Advanced Research Projects Agency.
References Barstow, D., Aiello, N., et al. (1983). Languages and tools for knowledge engineering. Building Expert Systems. F. Hayes-Roth, D. Waterman and D. Lenat, Addison-Wesley: 283–345. Cohen, P., Schrag, R., Jones, E., Pease, A., Lin, A., Starr, B., Gunning, D. and Burke, M. (1999). The DARPA High-Performance Knowledge Bases Project. AI Magazine 19(4): 25-49. Cohen, P.R. (1995). Empirical Methods for Artificial Intelligence. Cambridge, MA: MIT Press. Duineveld, A.J., Stoter, R., Weiden, M.R., Kenepa, B. and Benjamins, V.R. (1999). Wondertools? A comparative study of ontological engineering tools. In: Proceedings of the Twelfth Banff Workshop on Knowledge Acquisition, Modeling, and Management, Banff, Alberta. Fenton, N.E. (1991). Software Metrics. London: Chapman and Hall. Gennari, J.H., Cheng, H., Altman, R. B., Musen, M.A. (1998). Reuse, CORBA, and Knowledge-Based Systems. International Journal of Human-Computer Studies 49(4): 523-546.
21
Grosso, W.E., Eriksson, H., Fergerson, R.W., Gennari, J.H., Tu, S.W. and Musen, M.A. (1999). Knowledge Modeling at the Millennium (The Design and Evolution of Protégé2000). In: Proceedings of the Twelfth Banff Workshop on Knowledge Acquisition, Modeling, and Management, Banff, Alberta. Menzies, T. (1999). hQkb- The High Quality Knowledge Base Initiative (Sisyphus V: Learning Design Assessment Knowledge). In: Proceedings of the Twelfth Banff Workshop on Knowledge Acquisition, Modeling, and Management, Banff, Alberta. Menzies, T.J. (1999). Knowledge Maintenance: The State of the Art. The Knowledge Engineering Review 14(1): 1-46. Musen, M.A., Tu, S.W., Eriksson, H., Gennari, J.H. and Puerta, A.R. (1993). Protégé-II: An Environment for Reusable Problem-Solving Methods and Domain Ontologies 93-491, Stanford SMI. Nielsen, J. (1994). Usability Engineering: Academic Press/Morgan Kaufmann. Rubin, J. (1994). Handbook of Usability Testing: How to Plan, Design, and Conduct Effective Tests: John Wiley & Sons, Inc. Schreiber, A.T. and Birmingham, W.P. (1996). Editorial: the Sisyphus-VT initiative. International Journal of Human-Computer Studies 44(3/4): 275-280. Shadbolt, N., O'Hara, K. and Crow, L. (1999). The Experimental Evaluation of Knowledge Acquisition Techniques and Methods: History, Problems and New Directions. International Journal of Human-Computer Studies, special issue on evaulation of KA techniques 51(4): 729-755.
22