Target Contour Testing/Instructional Computer Software (TaCTICS): ... regarding inter-and intra-user performance in target .... 3 http://dicom.rubyforge.org/. 4.
Page 1 of 5
Target Contour Testing/Instructional Computer Software (TaCTICS): A Novel Training and Evaluation Platform for Radiotherapy Target Delineation Jayashree Kalpathy-Cramer PhD 1, Clifton David Fuller, MD2 1 Oregon Health & Science University, Portland, OR; 2University of Texas Health Science Center at San Antonio, TX Abstract Target volume delineation is a critical step in the creation of treatment plans in radiation therapy. However, intra-observer and inter-observer variability in target volume definitions can introduce substantial differences in resulting doses between treatments plans from different users and institutions Consequently, there is a need for tools that allow quantitative metrics to be collected and reported regarding inter-and intra-user performance in target volume delineation. We describe TaCTICS, a webbased educational training software application targeted towards residents and non-expert users. TaCTICS allows users to delineate target structures in DICOM-RT compatible formats using their preferred treatment planning system. After uploading the resulting structure file, users are provided a scoring of their structures based on comparison to reference sets derived from expert users using a variety of metrics for volume overlap and surface distances.
Despite the well-established clinical importance of accuracy in target volume delineation, inter-observer variability in target definition has been demonstrated in a series of studies, in multiple organ/anatomical sites [1-4]. Simply put, “interobserver variability in the definition of GTV and CTV is a major – for some tumor locations probably the largest – factor contributing to the global uncertainty in radiation treatment planning” [5]. In this paper, we discuss the development of a software platform “Target Contour Testing/ Instructional Computer Software (TaCTICS): A Novel Training and Evaluation Platform for Radiotherapy Target Delineation”1. This is a prototype for a target delineation statistical software with a graphical user interface (GUI) that allows for near real-time data analysis and reporting of quantitative scoring metrics that compare userderived structures with reference sets derived from expert users, specifically to allow self-evaluation for residents and other non-expert users
Introduction
Background
Conformal radiotherapy affords delivery of tumoricidal radiation doses to user-defined target volumes while minimizing dose to spatially adjacent non-target organs-at-risk. This precision of computerenabled delivery allows exceptional dose-volume matching capability; nonetheless, the steep dose gradients imply that even minor geometric uncertainties may result in substantial dose deviations from intended prescriptions, which may in turn underdose tumors or overdose radiosensitive tissues. Target volumes (TV) and organs-at-risk (OARs) for treatment planning are manually defined by human users as regions of interest (ROIs), introducing possible geometric variability due to variations in gross tumor (GTV), clinical target (CTV) or internal target volume (ITV) delineation.
Conformal radiotherapy has brought the capacity to shape dose gradient in an effort to approximate threedimensional tumors and target structures, such that tumoricidal dose may be delivered to areas at risk for neoplastic involvement, while sparing proximal nontarget tissue. The planning process for conformal radiotherapy is predicated on dose calculations derived from the Hounsfield units of given voxels on a simulation DICOM image set (typically CT). Voxels within this dataset are then designated as ROIs and assigned nominal designations as GTV, clinical areas of presumed microscopic spread (CTV), or OARs. Only those volumes defined by the physician end user may be utilized to either prescribe sufficient dose to ensure tumor demise (GTV/CTV) or spare organs (OARs) through dose delivery constraints (International Commission on Radiation Units and Measurements. 1999). The ROIs are then utilized as DICOM-RT structures for dose calculation by the radiotherapy treatment planning software.
As the initial step in radiotherapy planning, target delineation becomes critical, since the most conformal plan is of reduced clinical utility if the delineated target volumes do not accurately depict actual targets or organ structures in 3D-space.
1
http://www.tacticsRT.com
AMIA 2010 Symposium Proceedings Page - 361
Page 2 of 5
Since these ROIs provide the definitions for dose constraints used for treatment planning, accurate target delineation is crucial for precision radiotherapy [6]. In the pre-conformal radiotherapy era, standardized fields were utilized to ensure uniformity of treated regions. However, in the era of volume-based delineation, data suggest that considerable operator dependant variation exists in target volume delineation and consequent dose distribution [5]. This variability complicates clinical trial quality assurance and prevents ready comparison of treatment protocols. Several groups have also sought to account for systematic variability introduced in target delineation in order to optimize volume definition and treatment planning margins. Recent survey data suggest, that significant numbers of radiation oncologists receive minimal formal training in intensity modulation radiotherapy, and points to a need for greater research into optimum methodologies for user instruction in target delineation, among other aspects of IMRT practice. Furthermore, in recognition of the importance of proper target delineation, a host of didactic educational activities have been created to assist clinicians in developing target delineation skill-sets for clinical practice. While there are software programs allowing interactive instruction (www.educase.edu, www.anatom-e.com)[7], few extant software/devices provide automated/semiautomated real-time instructional feedback regarding target volume delineation skill-development for trainees in radiation oncology. Little extant data exist regarding how to evaluate acceptable levels of user competency in target delineation [8]. Despite great interest, comparatively little data has to date been presented regarding strategic optimization of target delineation itself, either as a function of standardized practices or as a function of deliberate educational curriculum.
in target delineation in a DICOM-RT enviornment. The purpose of this effort is to develop a software application which will allow users to delineate target structure ROIs in DICOM-RT compatible formats, followed by automated comparison and scoring of user-derived with ROIs defined by reference sets derived from expert users. Methods The steps in creating TaCTICS consisted of : 1. Collecting a set of more than 400 structure contours from a variety of expert and non-expert users for five anatomical sites. 2. Identifying a set of meaningful quantitative metrics that can be used to compare three-dimensional volumes. 3. Creating a web-site and software tools that enable user to upload a DICOM-RT file containing the structures for comparison with expert derived structures and receive an automated report with metrics identified for comparison. We will briefly discuss each of these steps next. Data These prospective IRB-exempt projects were conducted under the auspices of the University of Texas Health Science Center San Antonio Institutional Review Board. As part of two separate target delineation protocols, anonymized patient DICOM files were used to construct target delineation datasets for comparison of inter- and intra-observer target delineation variability. In each of these datasets, observers contoured the same dataset twice, albeit with either instructional or software modification [9] as a testing variable.
Several series have also established that user variability in target volume delineation may result in potentially significant dosimetric differentials between prescribing radiation oncologists. Additionally, collected data suggest that clinical trial data may be obfuscated by user-dependent differentials in prescription volume determination. While this may be partially ameliorated by modification of study criteria with regard to volume delineation, there remains, at present, no efficient, automated mechanism to evaluate target volumes.
Dataset A consists of DICOM-RT ROIs derived from a double-blind, randomized hypothesis generating pilot study [9, 10] designed to test the impact of instructional modification of user-generated contours. Users were asked to contour a standardized case presentation of T3N0M0 rectal cancer case twice, with half of users randomized after the initial contouring session to receive a (then unpublished) electronic PDF of a newly developed consensusbased anatomic atlas. Results of this data have been presented previously. The study enrolled 15 radiation oncologist observers (experts and non-experts), who submitted a GTV, and 2-3 CTVs for ach of 2 contouring sessions, resulting in 94 distinct ROI structures available for analysis.
Consequently, there is a great need for tools that allow evaluative measures to be collected and reported regarding inter-and intra-user performance
Dataset B consists of a series of DICOM-RT files derived from a study of human-computer user interface device (UID) modification on target volume
AMIA 2010 Symposium Proceedings Page - 362
Page 3 of 5
delineation efficiency [10]. Observers were asked to contour the stereotypic cases from several anatomical sites (representing a prostate, brain, lung, and head and neck case presentation) twice; once using a standard mouse-keyboard configuration, and once using a graphic tablet –pen interface. A total of 21 observers contoured brain, head and neck, lung and prostatic GTV/CTV ROIs once with each UID resulting in >400 collected ROI TV structures. For each of these sites, two users had been designated as ‘experts’ based on their experience and standing in the field. Measures of conformity Quantitative measures of conformity include volumetric measures of spatial overlap in 2D or 3D, as well as surface distance measures. Some of the most commonly used measures include [11]: • Volumetric Difference (VD)
measure of the voxels that were considered positive according to the expert but missed by the user being evaluated.
• Hausdorff distance (HD): Unlike the region–based approaches given above, surface distance metrics are derived from the contours or the points that define the boundaries of the objects. The Hausdorff distance (HD) is commonly used to measure the distance between point sets defining the objects. The HD between A and G, h(A,G) is the maximum distance from any point in A to a point in G and is defined as where
Where Va is the volume of the user-derived contour and Vg is the volume of the “goldstandard” (or the expert-derived contour)
The symmetric Hausdorff distance is given as: The Hausdorff distance, although commonly used, is highly susceptible to outliers resulting from noisy data. However, many variations of a more robust version of this measure have been used for in our application where the outliers are discarded. We have implemented all of the above-described measures in our system.
However, these measures do not take into account the spatial locations of the respective volumes, and hence have limited utility when used alone. • Dice/Jaccard: these measures, the Dice and Jaccard coefficients (or Tanimoto) are the most commonly used measures of spatial overlap for binary labels. Consider the figure below where A is the user contour and G is the gold standard.
Figure 1 Overlap of user and expert contour The Dice and Jaccard coefficients are given as:
The Dice coefficient has been shown to be a special case of the kappa coefficient, a measure commonly used to evaluate inter--rater agreement. As defined, both of these measures are symmetric. However, in situations such as contouring for radiation oncology where the cost for missing the tumor is higher, false positive and false negative Dice measures can be used. The false positive Dice (FPD) is measure of voxels that are labeled positive (i.e. 1) by the user but not the expert while the false negative Dice (FND) is a
• STAPLE: Warfield [12] proposed the Simultaneous Truth and Performance Level Estimation (STAPLE), an expectationmaximization algorithm that computes a probabilistic estimate of true segmentation given a set of manual contours. In addition to the expert derived contours we have created an additional set of “ground truth” contours using this algorithm. Software and Website The website was built using a Ruby on Rails framework2. The ruby-dicom gem3 was used for parsing the DICOM files. A PostgreSQL4 database was use to store the user information, information about the studies including location of the CT slices, information extracted from the DICOM header including names, volumes and slice information for the structures, and the metrics derived from all users. The main processing of the structures and calculation 2
http://rubyonrails.org/ http://dicom.rubyforge.org/ 4 http://www.postgresql.org/ 3
AMIA 2010 Symposium Proceedings Page - 363
Page 4 of 5
of the metrics was performed in C++ using the ITK toolkit.5 These procedures were wrapped in Ruby and called from the website to generate the report. The flow of data and user interaction for the system is given below.
can then, by their relative histogram position, judge visually as well as numerically their agreement with an expert-derived reference. They can also perceive how they performed compared to other users of the system or compared to their metrics from previous attempts.
Figure 3 Histogram of metrics with current user's highlighted Discussions
Figure 2 System Design After a user logs in to the system, they can download the desired CT slices without the contours, contour the structures using their usual treatment planning system and upload the resulting DICOM RTSTRUCT file. They can then select an expert to be used as the reference. Alternatively, they can compare their structures to those created using STAPLE, as described in the previous section. The users are then e-mailed a report containing all the chosen metric, histograms of all the corresponding metrics available in the system with their highlighted, as well as thumbnails of CT slices with their contours and those of the expert overlaid. An example for the prostate dataset for user #1 is shown below. Users can identify their place on a histogram of all users (red highlighted region). Users can then, by their relative histogram position, judge visually as well as numerically their agreement with a reference. The users are then e-mailed a report containing all the chosen metrics, histograms of all the corresponding metrics available in the system with their values highlighted, as well as thumbnails of CT slices with their contours and those of the expert overlaid. An example of the histogram of the dice coefficients for the prostate dataset for user #1 is shown below. Users 5
We have designed and implemented a contour evaluation software platform for use in radiation oncology. Our target audience for this phase was residents and other non-expert users who contour tumors and OAR as part of the process of creating a ratiation therapy plan. In order to get sufficent data to extract meaningful comparison between users, we have collected over 500 different structures for five anatomical locations. Despite this great interest, comparatively little data has to date been presented regarding strategic optimization of target delineation itself, either as a function of standardized practices or as a function of deliberate educational curriculum. Using previously collected pilot data, we have determined that there is substantial inter-observer variability in terms of target volume reproducibility. We hope that by constructing a GUI that allows users to analyze target volume ROIs and gain meaningful "scores" regarding their performance, we will achieve more consisted target volume delineation. Conclusions and Future Plans We believe that the TaCTICS tool may be useful in the educational context, as well as in potentially reducing variability among users and sites during multi-institutional clinical trials. We have submitted a grant for a prospective study that aims to evaluate the effectiveness of TaCTICS is reducing inter-rater differences in target volume delineating and increasing the conformance of user contours to those of clinical experts. At present, our sample size of expert observers is limited. Our current dataset used 2 experts per anatomic subsite; we hope to expand this number substantially, in order to improve the quality of the reference data.
http://www.itk.org/
AMIA 2010 Symposium Proceedings Page - 364
Page 5 of 5
We will continue to add anatomical locations and additional patient studies as we acquire more data. Additionally, as users participate in utilizing this tool, we will expand our set of contours, enabling us to provide statistical data with increased sample sizes. Also, prospective studies allowing model/software validation as a training and feedback tool will assist in optimizing user interface features [13]. These planned studies will formally evaluate usability, utility, and ease of implementation in an educational setting. Finally, we plan create treatment plans using the more extreme contours to better understand the impact of the inter-rate variability in the contours on the final dose profiles in the tumors as well as nearby organs.
5.
6.
7.
8.
Acknowledgments JKC was supported by a K99-R00 grant from the National Library of Medicine 1K99LM009889-01A1 C.D.F. was supported by a T32 Training Grant from the National Institutes of Health/National Institute of Biomedical Imaging and Bioengineering, (‘‘Multidisciplinary Training Program in Human Imaging”, 5T32EB000817-04), a Technology Transfer Grant from the European Society for Therapeutic Radiology Oncology, and the Product Support Development Grant from the Society for Imaging Informatics in Medicine. The funder(s) played no role in study design, in the collection, analysis and interpretation of data, in the writing of the manuscript, nor in the decision to submit the manuscript. References 1.
2.
3. 4.
Foppiano F, Fiorino C, Frezza G, et al. The impact of contouring uncertainty on rectal 3D dose-volume data: results of a dummy run in a multicenter trial (AIROPROS01-02). Int J Radiat Oncol Biol Phys 2003; 57:573-579. Jeanneret-Sozzi W, Moeckli R, Valley J, et al. The Reasons for Discrepancies in Target Volume Delineation. Strahlenther Onkol 2006; 182:450457. Njeh CF. Tumor delineation: The weakest link in the search for accuracy in radiotherapy. J Med Phys 2008; 33:136-140. Rasch C, Keus R, Pameijer FA, et al. The potential impact of CT-MRI matching on tumor volume delineation in advanced head and neck
9.
10.
11.
12.
13.
cancer. Int J Radiat Oncol Biol Phys 1997; 39:841-848. Weiss E, Hess C. The impact of gross tumor volume (GTV) and clinical target volume (CTV) definition on the total accuracy in radiotherapy theoretical aspects and practical experiences. Strahlenther Onkol 2003; 179:21-30. Jeanneret-Sozzi W, Moeckli R et al. The Reasons for Discrepancies in Target Volume Delineation. Strahlenther Onkol 2006; 182(8): 450-457. Steenbakkers, R., J. Duppen, et al. Observer variation in target volume delineation of lung cancer related to radiation oncologist–computer interaction: A ‘Big Brother’ evaluation. Radiother Oncol 2005, 77(2): 182-190. Bekelman, J. E., S. Wolden, et al. Head-andneck target delineation among radiation oncology residents after a teaching intervention: a prospective, blinded pilot study." Int J Radiat Oncol Biol Phys 2009; 73(2): 416-23 Fuller CD, Duppen J, Rasch CR, et al. A Prospective Randomized Pilot Study of Sitespecific Atlas Incorporation into Target Volume Delineation Instructions in the Cooperative Group Setting: Preliminary Results from a Southwest Oncology Group Pilot using Big Brother. Int J Radiat Oncol Biol Phys 2009; 75:S136-S137 Fuller CD, Nijkamp J, Duppen J, et al. Prospective randomized double-blind pilot study of site-specific consensus atlas implementation for rectal cancer target volume delineation in the cooperative group setting. Int J Radiat Oncol Biol Phys 2009. Babalola KO, Patenaude B, Aljabar P, Schnabel J, Kennedy D, Crum W, Stephen Smith, Tim Cootes, Mark Jenkinson, Daniel Rueckert, An evaluation of four automatic methods of segmenting the subcortical structures in the brain, NeuroImage, Volume 47, Issue 4, 1 October 2009, Pages 1435-1447. Warfield S, Zou K, Wells W, Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Trans Med Imaging. 2004 Jul;23(7):903-21. Skeate, R. C., M. M. Wahi, et al. (2007). "Personal digital assistant-enabled report content knowledgebase results in more complete pathology reports and enhances resident learning." Hum Pathol 38(12): 1727-35.
AMIA 2010 Symposium Proceedings Page - 365