classification of the tumors for the LT and STT sub-domains. The taxonomy for ..... Rasmussen, J., Pejtersen, A.M., & L.P., Goodstein (1994). Cognitive Systems ...
Use of the CANTOR system for collaborative learning in medical visual object recognition Hans H. K. Andersena Verner Andersenb, , Birgit G. Skovc a,b
c
Risø National Laboratory, System Analysis Department, Roskilde Denmark Gentofte University Hospital (GUH), Department of Pathology,Gentofte, Denmark
ABSTRACT This paper reports from a user requirement, design and evaluation study on supporting collaborative learning by visual perception in the medical education domain. Establishing the user requirements has been based on the Cognitive Systems Engineering approach. The user requirements laid the foundation for designing the software system CANTOR (Converging Agreement by Networking Telematics for Object Recognition) can briefly be described as a tool that support collaborative consensus making when classifying sets of medical images or objects in medical images. In evaluation experiments we have focused on the application of CANTOR in the early learning and training phase. We have assessed whether CANTOR can be used as a collaborative training tool in a qualitative and quantitative way. The evaluation experiments showed that using CANTOR seems to give a better learning effect than by using traditional methods. Keywords: Visual Learning, Evaluation, Collaborative Medical Classification, Inter-, and Intra-observer Variability.
INTRODUCTION As in many other domains learning in medicine is a life-long process. To specialize in pathology, for example, can last for up to ten years. Standardization of the learning processes is also needed to ensure a standardized high quality output of the medical work. In this paper we focus on the collaborative processes involved in learning to recognize, and to create consensus with respect to classifying, visual objects in medical images. Traditionally many of these processes have been of the master / student type. That is the student learns how to classify under close supervision of an expert. This is a rather learning effective but costly educational activity and the level of expertise available may vary from place to place. The question is how better to support the collaborative learning processes and standardize the level of expertise within a group of students through training using the same system. Within CSCL one of the important research questions is how to specify important design considerations when developing collaborative learning systems and how to implement such specifications. (Koschmann, 1996). In this paper we suggest to use a Cognitive Systems Engineering approach to meet these challenges. Especially we suggest that the users should be directly involved in specifying and evaluating their requirements for a collaborative learning tool.
Same place Different places
Same time face-to-face interaction synchronous distributed interaction
Different times asynchronous interaction asynchronous distributed interaction
Table 1 An example of the 2x2 time and space matrix introduced by Johansen (1988) for categorizing CSCW applications. The CANTOR system supports collaborative consensus making by letting a group of students view, share, compare, rank, and finally join individual and / or mutual classification results. In this way the system stimulate learners to
be more effective in learning and applying learning strategies. CANTOR is based on the idea of self-regulated learning (Schunk, 1989; Zimmermann, 1986). The CSCL field share many resemblances to that of CSCW (Computer Supported Cooperative Work). CSCW applications have often in the literature been categorized according to a 2x2 time and space matrix introduced by Johansen (1988). According to this type of categorization CSCW applications can be conceived as enhancing real-time communication and collaboration or asynchronous interactions. Furthermore, the CSCW applications can be categorized as to whether they support actors engaged in face-to-face interactions or distributed in many locations. According to this matrix in Table 1 the CANTOR system supports asynchronous distributed collaborative learning
THE LEARNING PROBLEM IN MEDICAL IMAGING In medical imaging, visual object recognition relies on subjective human visual perception. This process suffers from a significant inter-, and intra-observer variability that has largely been studied in the literature (Ackerman et al., 1996; Sorensen, et al., 1993; Corona et. al., 1996;. Nedergaard et al., 1995) It is known that the concordance between pathologists shows a large variation, even between expert pathologists. Kappa scores in studies range from 0.5 for many tumor categories till 0.9 for malignancies of the lymphatic system (Kappa is a measure of agreement in image recognition excluding agreement by chance). The latter high score is however reached using additional techniques such as immuno-histochemistry and genetic analysis apart from HE stained sections. In practice, the situation is not as dramatic as these figures are suggesting because probably most discrepancies occur in sub-categories and the clinical consequences of these sub-divisions are often not yet fully determined. However, this could change as a consequence of clinico-pathological research showing differences in biological behavior of sub-types leading to refinement of therapeutic modalities. Discrepancies can have many causes. One of them is the lack of uniform criteria. In addition, the correct identification of features and application of established criteria may be difficult as well. The lack of uniform criteria can be tackled by consensus formation between expert pathologists followed by thorough clinical follow-up. But problems exist in making efficient consensus activities mainly due to managing distribution of slides, speed of the process, and the multiplicity and complexity of the criteria. The lack of uniform identification and application of the (established) criteria can be overcome by education and training under guidance of expert pathologists. One of the main purposes of CANTOR is to support this process in local environments as well as using a network application.
METHOD Elicitation of User Requirements The methodological approach applied in the user requirement study has mainly followed the principles and concepts offered by the Cognitive Systems Engineering (CSE) approach developed at Risø National Laboratory Rasmussen (1988; Rasmussen et al, 1994). It allows the work analyst to analyze a system of work in terms of means-ends relationships indicating the why, what and how relations among the various layers in the hierarchy. This may support the fact that the approach was not focused only on the result of classifying morphological features, but also to reveal the underlying arguments for their classifications. The methods applied in the user requirement study all belong to the qualitative area of research: Interviews (qualitative, semi-structured, unstructured) Document inspection (worksheet reports, standards, quality assessment schemes, handbooks, laboratory manuals, classification lists, diagrams, drawings, etc.) Observations (activities at the microscope, presentation of labs) The elicitation of the user requirements was carried out during a period of 6 weeks visiting seven hospitals. 15 persons have been interviewed (lengths of interviews 6-9 hours). All interviews have been tape-recorded. Notes were taken during the interviews. All tapes from the interviews have been transcribed. The interviews were pre-planned in terms of time and place and the nature of the question to be asked. Although we had lists of questions these were seldom followed in any strict kind of way. In some cases it helped us to keep an overview of the course of the interview, for example, to avoid too many guiding questions, and to incorporate a number of ‘checkpoints’ in terms of summaries. On the other hand the form of the interviews ranges from being semi-structured to rather unstructured - almost like a conversation. The CSE means-ends abstraction hierarchy has been utilized as an analytic tool in formulating the user requirements in three levels: Strategic, procedural and operational. The strategic requirements mirror the goals, purposes and constraints governing the interaction between the medical work system (e.g. cancer diagnosis) under consideration and its i E l l i f i f i k d i h li f
histopathological diagnoses. In addition, the strategic requirements represent concepts that are necessary to set priorities, such as quality of service and categories of diseases with respect to the diagnosis. The procedural requirements characterize the general functions and activities of classifying auto-immune sera based on pattern recognition of entire images or diagnosing cancers such as recognizing and combining distinct morphological features or contribute to the science and practices of diagnostic histopathology. The operational requirements represent the physical activities, such as use of tools and equipment. Furthermore, the operational requirements signifies the physical processes of equipment and the appearance and configuration of material objects, such as staining, clinical information, multi-head microscopes in the traditional set-up now replaced by presentation of the images and related information supported by various software tools on the PC screen, etc. Feedback from the end-users not only at the end of the project, but also during evolution of the prototype secured an interactive and iterative process of development.
FUNCTIONALITY Autoimmune Serology Domain The topic focused upon in the autoimmune serology domain is related to the diagnosis of chronic immunoinflammatory connective tissue diseases, an area that is both clinically and paraclinically complicated. The clinical immunological laboratory has a very important role in helping the clinicians to reach a correct diagnosis and to estimate prognosis of a given disease. Both missing a correct diagnosis early in the clinical work-up and presenting a false positive result to the clinician may lead to serious consequences for the patient, increased costs for the health care and social security systems and serious reduction in quality of life for the patients and their families. Therefore, optimizing diagnostic testing by using one common nomenclature for positive results, supported by unequivocal definitions, and the possibility to relate terms and images to reference images in a database is mandatory. Three centers participated in reaching agreements on the taxonomy for the domain as well as on a reference classification of positive and negative staining patterns representing positive and negative reactions. Two image databases representing the same categories were constructed for baseline knowledge testing, training and final examination (certification). The database for baseline knowledge testing and for final examination consists of 40 distinct single pattern image sets (one image at *250 and one image at *400 for each case). The database for the training consists of 45 distinct images sets. These 85 (40+45) images differed from the images that were used during the construction of the taxonomy for the AIS domain and which were attached to that taxonomy as reference images. Figure 1 shows an example of a tool supporting the verification of a classification by allowing comparing the classified image with one corresponding to the selected classification.
Figure 1: Screen shot of the "Matcher" tool of CANTOR. The classification types (taxonomy) are indicated in the scrollable text at the right hand side. Cancer Histopathology Three sets of lesions from different cancer sub-domains consisting of cases represented by their images were developed. The number of cases used in each sub-domain is: 42 lung tumors (LT), 40 soft tissue tumors (STT), 47 melanocytic lesions (MEL). An average of approximately 2, 7, and 10 images were used for each case for the respective subdomains. This number relates to the number initially considered to be sufficient for a diagnosis. Three participating laboratory provided one set of lesions each: LT (Lung tumors), STT (Soft tissue tumors), and MEL (Melanoma lesions) together with a classification set (set of answers) to be considered as the “expert” (reference) classification set
These diagnoses were made and/or verified using all the available original glass slides by pathologists considered to be at least the regional expert in their respective area. The taxonomies are based on the World Health Organization classification of the tumors for the LT and STT sub-domains. The taxonomy for the MEL sub-domain was made by one of the expert participants. The numbers of labels in the taxonomies for each sub-domain are: LT: 109, STT: 226 and Mel: 74. However, in the MEL sub-domain, our cases covered only 5 of these 74 categories. This was done to evaluate the software in the area of distinguishing a benign from a malignant lesion with overlapping characteristics. All participants in the classification rounds were informed of this before they started to classify. All participants in this part were experienced pathologists, however in different categories of lesions. Each participant classified the lesions/cases in each database at most 3 times over a period of approximately six weeks. These classifications were stored in classification sets, one for each sub-domain. Therefore, each participant has produced a maximum of 9 classification sets. After each round, the coordinator changed the presentation of the images as much as the software permitted in order to limit the memory effect. The modified databases were sent to the participants at the beginning of each round by the coordinator who thereby set the pace of the study. A standard naming convention was used for the databases as well as for the classification sets. After the second round the expert classification sets (“answers”) were made available to all participants. The classification sets of the participating pathologists were compared with this expert classification set after each of the 3 classification rounds to test for learning effects. Figure 2 shows a screen dump of for diagnosing a potential lung tumor including annotations and the lung tumor taxonomy.
Figure 2. The CANTOR classification window and taxonomy for Lung tumor diagnosis. Experiment Comparing Learning Utilizing CANTOR and Learning from Textbooks The objective of this experiment is to get a qualitative assessment of the usability of CANTOR for learning of 'students' within the domain of lung cancer histopathology. In the experiment 'introductory doctors' from Denmark participated. Introductory doctors are fully educated physicians who want to get a first insight into the histopathology domain and eventually decide to specialize in the area and start their carrier to become experts in the lung cancer domain. Ten physicians were at this stage of their education in Denmark, and they were all contacted to participate in the experiment. All agreed to participate. Unfortunately, due to specific weather conditions, only 6 of them appeared for the experiment, which was performed as planned. However, the small number of participants prohibits a valid statistical comparison. Hence any differences seen in the result should be interpreted more as a qualitative than a quantitative indication of the benefit of CANTOR as a learning and training tool. It is current practice to use the WHO booklet to learn how the different morphological features look like. The student can inspect sections of the book and compare the pictures with the microscopic image of the tissue to be classified. An expert from GUH briefly introduced all 6 persons to the experiment from the medical point of view. We then distributed the forms they should use when filling in the diagnoses, and gave them some time to get familiar with the classification structure and how to use the form. Then they diagnosed individually 30 cases of lung cancer presented by slides as a base line test of their initial skill. The doctors had 1-2 minutes to decide for each diagnosis. In some cases the expert gave a tiny bit of information, e.g., on preparation type. After this introduction they were split into two groups, and they were allowed about two hours for self training, one group using the CANTOR system and the other group using the standard WHO text book on 'Histological Typing of Lung and Pleural Tumors Third edition 1999'
Following this session they were all diagnosing 31 lung cancer cases, once again presented by slides. This time 15 new cases were introduced. The remaining 16 images were mirrored and/or turned upside down and shuffled with the 15 new ones to reduce visual recognition. The 15 new ones were related in diagnosis to the cases they replaced. The improvement or deterioration for each participant was tested comparing the new success rate with the base line results. Since the two groups were small and the number of images shown limited, only qualitative results were obtained. The physicians perspective has been assessed by means of a questionnaire that has been used and validated in various settings. The results provide a quantitative measure for the usability. Also the written comments provided valuable feedback on features that could improve the usability of CANTOR. The organizational structure of the session is shown in Figure 3. We first gave the group an overall introduction to the CANTOR project, i.e. what the software is meant to support and the context for using it. Moreover, we stressed that it was the software usability we wanted to test during the trial, not the performance of people engaged in the trial. Furthermore, we gave a presentation of the software from a technical point of view presenting the functions the physicians were going to use during the day. We showed how to use "view classifications", using an expert classification set for educational browsing - just like an ordinary book. For training we showed them "guided classification", giving immediate feedback from the expert during classification. Then we showed how to use "manual classification" to classify on their own without guidance. Last we showed them "visual comparison" for comparing their completed classification set against the expert set.
Group of introductory doctors Presentation of CANTOR and medical background Classification of 30 cases Group 1 WHO Histological Typing
Group 2 Extended Doors
Classification of 31 cases • Evaluation
Figure 3. Presentation of the experimental design.
EVALUATION Autoimmune Serology Domain Table 2 indicates the results of learning and training using CANTOR within the domain of Autoimmune serology. All participants of the session had to follow a specific set of rules for the learning and training phase for being included in a comparison study.
Participant
Initial status of the participant
Baseline
Certificatio n
A
Experienced
1.00
1.00
B
Expert
1.00
1.00
C
Expert
1.00
1.00
D
Experienced
0.97
0.97
E
Expert
0.95
0.95
F
Experienced
0.92
0.95
G
Experienced
0.90
0.97
H
Novice
0.90
0.92
I
Experienced
0.90
1.00
J
Novice
0.87
0.87
K
Experienced
0.84
0.90
L
Novice
0.74
0.95
Table 2: Classification results for the 12 participants that met our requirements for inclusion in the study. The threshold for expert ship is normally indicted by κ>0.95. Three persons were experts before the training and did not improve their performance; two of them had no possibility for improvement. Six persons were experienced with two of them indicating expertise already from the start whereas the remaining four improved significantly. Three of them even got to or above the level of expertise. For the three novices one did not improve, one improved slightly, and the last improved from a rather low level to the level of expertise during the training period of about five weeks. These participants illustrated the learning and training potential of CANTOR. The results also show different capabilities of the novices to grasp image interpretation. Cancer Histopathology The results from both cancer histopathology sessions were the starting point for discussions on the reasons for and on the possibilities to limit the variation in order to reach consensus, standardization and a higher quality. Discussion points could then be: are the variations caused by lack of uniform criteria and is therefore consensus formation (standardization) in the group needed? Other aspects to be considered are: quality of the images, lack of representative images, and lack of familiarity with the lesions and therefore additional training needed. An additional problem in the three CHP domains is the large number of categories one has to deal with. This is much larger than in the AI domain, and will inevitably lead to more room for discrepancies. Lastly, one has to take into account that agreement on the diagnosis is required up to the level where it still makes a difference in e.g. treatment or prognosis. A more detailed classification beyond relevance for the treatment is to some extent an academic exercise. It is considered that future versions of CANTOR should allow the calculation of kappa values and degree of agreement at various levels in the taxonomy, and not necessarily at the leaf nodes only. Table 3 indicates the improvement in performance of the pathologists from the initial phase of diagnosing to the final round after learning and training. The poor start performance is due to the fact that the pathologists should diagnose in a sub-domain not covered by their expertise. However, after the education round combined with their overall expertise the performance in most cases
CANCER HISTOPATHOLOGY SUB- CLASSIFIER DOMAIN
KAPPA ROUND 1 BEFORE KAPPA ROUND 3 AFTER RELEASE RELEASE “EXPERT”SET “EXPERT”SET
LT
A
0.85
1.00
B
0.80
0.95
C
0.51
0.92
D
0.22
0.29
B
0.00
1.00
E
0.02
0.29
F
0.00
0.63
B
0.62
0.87
C
0.05
0.74
E
0.27
0.74
F
0.29
0.92
MEL
SST
Table 3: The improvement of the kappa value between round 1 and round 3. The major improvement was after the release of the “expert” set between round 2 and 3, indicating the ability of CANTOR to detect a possible learning effect. Results from Comparing Learning Utilizing CANTOR and Learning from Textbooks In general the experiment indicates that CANTOR has at least the same educational and training effect as textbooks. The scores provided by the students on the usability questionnaire indicated that the components of the CANTOR software that allow the classification of (objects in) images and for the comparison of classifications and the inspection of differences were well appreciated. Following the session the group of physicians not using the CANTOR system were introduced to the system by their colleagues having used the system for a couple of hours. This introduction and the fact that all six physicians hereafter were able to benefit from the system by making diagnoses and making self-control of these - by using the CANTOR tools - indicated the user friendliness of the system. Setting up an application with the software, however, requires some skills and approaches to obtain the required set-up and to avoid program shutdowns. Suggestions are given on which aspects of the software attention may be needed to further improve stability and user friendliness of the software from an administrative point of view. The number of correct diagnoses for each group is shown in Figure 4. Figure 5 shows the correct diagnoses on person level approximated with a normal distribution regarding average and deviation. 50
46 43
45
Correct
40 35
Doors pre-training
30
25
25 20
Doors post-training Textbook pre-training
17
Textbook post-training
15 10 5 0 Groups
Figure 4. The number of correct pre-training and post-training diagnoses for each group as compared to 90 and 93 classifications, respectively.
0,25
Cantor,pre: mean=5,67; var=1,44 Cantor, post: mean=15,33; var=1,19 Textbook, pre: mean=8,33; var=2,12 Textbook, post: mean=14,33; var=2,68
0,2 0,15 0,1 0,05 0 0
10
20
30
40
-0,05
Figure 5. The average number and variation of correct pre-training and post-training diagnoses on person level approximated with normal distributions based on 30 and 31 classifications, respectively. The scaling of the curves is arbitrary. The result (still just seen as a qualitative indication) indicates the CANTOR system as just as good and may be even better than the textbook as an education tool. If the increase in the mean value of correct diagnoses relative to the average of the variations before and after the training is taken as a measure of improvement, this value is about three times larger for the CANTOR training than for the textbook training. However, this is strongly influenced by the spreading of performance of the trainees, and this spreading is for the textbook trainees by chance nearly twice the magnitude of the CANTOR trainees. Furthermore, as indicated before, the two groups are too small to make any real quantitative significant conclusions from the results. The session was concluded by a discussion concerning the usability and user friendliness of the CANTOR system for education and training of cancer diagnosis. Except for minor suggestions related to the user interface, the general opinion of the participating physicians was very positive. They found the system not only valuable, but also inspiring due to the tools allowing direct feedback of their performance as compared to the expert opinion, and allowing objective indication of personal improvement by the Kappa value.
CONCLUSION The experiments have shown that CANTOR is a valuable tool for learning and training. Using CANTOR seems to give a better learning effect than by using textbooks. Our studies are, however, limited since only measurements directly after training has been made. Whether using CANTOR has a longer lasting effect than studying from books needs further investigation. Furthermore, using the CANTOR system in learning and training of medical persons for AI-serology experience or pathological diagnoses will be more cost effective due to the increased computer supported collaborative learning effect replacing to a high degree the need for presence of real experts. Indeed, in front of a difficult diagnosis, a young isolated pathologist may greatly benefit from this database as well as the consolidation of the standards for disease classification.
ACKNOWLEDGMENT This work has in part been funded by the European Commission project CANTOR, Telematics Healthcare, HC4003
REFERENCES Ackerman AB et al. (1996) Discordance among expert pathologists in diagnosis of melanocytic neoplasms. Hum Pathol.; 27(11):1115-1116.
Corona R, Mele A, Amini M, De Rosa M, Coppola G, Piccardi P, Fucci M, Pasquini P and Faraggiana T. (1996). Interobserver variability on the histopathologic diagnosis of cutaneous melanoma and other pigmented skin lesions, Journal of clinical Oncology. Johansen, Robert: (1988). Groupware. Computer Support for Business Teams, The Free Press, New York & London Koschmann, T. D. (ed.) (1996): CSCL theory and practise of an emerging paradigme. Mahwah, New Jersey.: Lawrence Earlbaum Associates, Publishers. 1-23. Nedergaard L, Jacobsen M, Andersen JE. (1995) Interobserver agreement for tumour type, grade of differentiation and stage in endometrial carcinomias, APMIS, July 103(7-8):511-518. Rasmussen, J. (1988). “A Cognitive Engineering Approach to the Modeling of Decision Making and Its Organization in Process Control, Emergency Management, CAD/CAM, Office Systems, Library Systems,” in Advances in ManMachine Systems Research, ed. by W. B. Rouse, vol. 4, JAI Press, Greenwich, Conn., 165-243 Rasmussen, J., Pejtersen, A.M., & L.P., Goodstein (1994). Cognitive Systems Engineering. New York: John Wiley. Scardamalia, M., Bereiter, R.S., Swallow, M.J., and Woodruff (1989). Computer-supported intentional learning environment. Jounal of Educational Compæuting Research, 5, 51-68. Schunk, D. H., (1989). Social-Cognitive Theory and Self-Regulated Learning. In B. J. Zimmerman and D. H. Schunk (eds.), Self-Regulated Learning and Academic Achievement: Theory, Research, and Practice. New York: SpringerVerlag Sorensen JB, Hirsch FR, Gazdar A, Olsen JE. (1993 ) Interobserver variability in histopathologic subtyping and grading of pulmonary adenocarcinoma CANCER, May 15, 71(10):2971-6. . Zimmerman, B. J. (1986). Becoming a self-regulated learner: Which are the key subprocesses. Contemporary Educational Psychology, 11, 306-313.