IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 9, NO. 1, MARCH 2005
99
Database Design and Implementation for Quantitative Image Analysis Research Matthew S. Brown, Sumit K. Shah, Richard C. Pais, Yeng-Zhong Lee, Student Member, IEEE, Michael F. McNitt-Gray, Member, IEEE, Jonathan G. Goldin, Alfonso F. Cardenas, and Denise R. Aberle
Abstract—Quantitative image analysis (QIA) goes beyond subjective visual assessment to provide computer measurements of the image content, typically following image segmentation to identify anatomical regions of interest (ROIs). Commercially available picture archiving and communication systems focus on storage of image data. They are not well suited to efficient storage and mining of new types of quantitative data. In this paper, we present a system that integrates image segmentation, quantitation, and characterization with database and data mining facilities. The paper includes generic process and data models for QIA in medicine and describes their practical use. The data model is based upon the Digital Imaging and Communications in Medicine (DICOM) data hierarchy, which is augmented with tables to store segmentation results (ROIs) and quantitative data from multiple experiments. Data mining for statistical analysis of the quantitative data is described along with example queries. The database is implemented in PostgreSQL on a UNIX server. Database requirements and capabilities are illustrated through two quantitative imaging experiments related to lung cancer screening and assessment of emphysema lung disease. The system can manage the large amounts of quantitative data necessary for research, development, and deployment of computer-aided diagnosis tools. Index Terms—Data models, database systems, image analysis.
I. INTRODUCTION
I
MAGING IS taking on a more prominent role in the diagnosis and quantitative assessment of disease. Quantitative image analysis (QIA) goes beyond subjective visual assessment to provide computer measurements of the image content, typically following image segmentation to identify anatomical regions of interest (ROIs). Image processing techniques allow a large and complex set of quantitative measures to be derived from images, particularly in a research setting. Commercially available picture archiving and communication systems (PACS) focus on storage of image data. They are not designed for efficient storage and mining of new types of quantitative data. This need is the focus of the development to be described in this paper. Examples of quantitative imaging can be found in lung cancer screening and assessment of emphysema lung disease. Lung
Manuscript received June 10, 2003; revised July 9, 2004. M. S. Brown, S. K. Shah, R. C. Pais, M. F. McNitt-Gray, J. G. Goldin, and D. R. Aberle are with the David Geffen School of Medicine, Department of Radiological Sciences, UCLA, Los Angeles, CA 90095 USA (e-mail:
[email protected]). Y.-Z. Lee and A. F. Cardenas are with the Henry Samueli School of Engineering and Applied Science, Computer Science Department, University of California at Los Angeles, Los Angeles, CA 90095 USA. Digital Object Identifier 10.1109/TITB.2004.837854
cancer screening involves imaging of high-risk patients, such as long-time smokers, to search for lung nodules (tumors) that may indicate the presence of lung cancer [1], [2]. Emphysema is another lung disease often associated with smoking, and imaging can be used to quantitate the extent and severity of the lung destruction [3], [4]. We will use these examples to illustrate general database requirements for quantitative imaging in medicine. Advances in image acquisition technology, computer vision systems, and new clinical/research questions are leading to increasing amounts of quantitative data being derived from medical images. Imaging modalities such as computed tomography (CT) and magnetic resonance are generating large volumes of image data. Lung cancer screening is performed with CT imaging. Until relatively recently, a single-slice helical CT through the lungs consisted of about 40 images spaced every 10 mm. However, new multislice CT imaging technology enables higher-resolution imaging of the body—400 images spaced every 1 mm through the chest. This offers the potential for detecting much smaller, early stage, lung nodules. However, there are many more images to read, placing a great burden on radiologists in terms of reading time and fatigue, and leading to the potential for missing tumors. This increase in the amount of image data to be interpreted is a common problem throughout medical imaging, and therefore, there is great research and commercial interest in computer systems to process the data. For example, preliminary studies have shown that image-analysis software that automatically searches CT images for lung nodules can be used to increase a radiologist’s sensitivity in detection of small lung nodules at an early stage when they may be more effectively treated [5], [6]. Systems can then measure nodule characteristics to assist in the diagnostic process [7], [8]. Quantitative image-analysis systems, such as these, are based upon computer vision techniques such as segmentation and feature extraction. These systems can assist with detection, quantitation, and visualization tasks. Image-analysis systems are making it possible to extract large amounts of measurement data from medical images and raising the need for data management/mining capabilities for large volumes of quantitative data. Given the advances in imaging modalities and image-analysis technology, radiologists are able to ask new clinical questions involving quantitative image data. For example, emphysema is being quantitated (using volume of gas per gram of lung tissue [4]) and changes in disease severity are being measured during evaluation of novel therapies [9]. Large outcome trials are required to evaluate a new therapy, and multicenter clinical trials are becoming more common to provide the required number of
1089-7771/$20.00 © 2005 IEEE
100
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 9, NO. 1, MARCH 2005
cases and unbiased populations. This contributes to the amount and heterogeneity of quantitative data and requirements for data mining capabilities. A number of institutions are developing quantitative imaging tools, but very few are involved in large clinical trials and often lack the necessary database infrastructure to accommodate quantitative imaging on such a large scale. If a quantitative imaging research group is involved in multiple clinical trials, typically they will use a separate database for each project which means that it is difficult to pool and mine data across projects to answer new/broader research questions. Computer-aided diagnosis (CAD) systems are also under development that use quantitative data extracted from images to reach a diagnosis that can be presented to a radiologist. For example, after nodules are detected in a lung cancer screening exam, CAD systems are under development to use image features to characterize the nodule as benign or malignant (cancerous) [7], [8]. To develop a CAD system, a classifier must be trained to automatically recognize patterns of features that indicate a particular abnormality or disease. This classifier training involves collection of a large number of image data sets and then extraction of a large number of features from each data set. The vast amount of quantitative data must then be stored and mined to select the relevant features and then train and test the classifier. Collaborations such as the Lung Imaging Database Consortium, sponsored by the National Institutes of Health (NIH) [10], are underway to collect the number of data sets required for lung cancer CAD development, but a major hurdle remains in the database infrastructure to store and mine the derived quantitative data. With few exceptions, research developments in CAD systems have been limited to small numbers of cases and features (tens or occasionally hundreds of cases). This lack of cases means that CAD systems have been difficult to generalize and very few have been commercialized and reached routine use in clinical practice. To accommodate the thousands of cases required for robust CAD development, database capabilities for quantitative imaging will have to be improved. To date, research groups involved in CAD have focused on the development of image processing technology rather than data management. Research has been performed in PACS and radiology information systems (RIS) [11], [12] and such systems are commercially available, however, they do not meet the specific needs of QIA research. Current RIS and PACS systems typically do not provide integrated access to image and report data. A commercial RIS stores patient demographic and report data. If any quantitative data is generated from an imaging exam, it is usually embedded within the text report, rather than in a structured database field. The development of a database structure for quantitative data is significant because it allows querying and mining of the quantitative data. This flexibility in querying is not currently supported within a RIS. The overall goal of this work is to develop an integrated environment to facilitate research and application of quantitative imaging techniques. This should be a single system that integrates image segmentation, quantitation, and characterization with database and data mining facilities—specifically for quantitative data. We will present generic process and data models for QIA in medicine and describe their practical use. The system should handle large amounts of data for development and val-
idation of CAD tools, and deployment in multicenter clinical trials. This infrastructure has not been available, and has been a limiting factor in bringing CAD tools to clinical practice. This need is driving the database research and development presented in this paper. II. PROCESS MODEL: APPLICATION CHARACTERISTICS AND SYSTEM DESIGN REQUIREMENTS The image-analysis within this project is currently centered around thoracic CT imaging. It involves automated image segmentation using a model-based approach to extract regions-ofinterest for quantitative analysis: volume measurement; attenuation measures; texture analysis [13], [14]. A variety of CAD applications are being developed around these quantitative techniques, e.g., nodule detection and characterization; characterization of early lung disease [5], [7], [9], [15]. However, the techniques and infrastructure are being developed to generalize beyond the thorax and beyond CT. The model-based image analysis software was designed to be generic and has been applied to other areas [16]. The data model to be presented is also designed to be extensible to multiple QIA tasks and research projects. In an imaging research setting, there are typically many variables being investigated, for example, variables in lung CT image acquisition: collimation, tube current, reconstruction algorithm, breathing state, etc. For each different imaging protocol, there are also many different quantitative features being extracted to search for the optimal combination of imaging parameters and features to characterize the disease process or clinical question to be answered. This requires a complex data model and queries. Once the meaningful variables are selected for use by a CAD system to perform a particular diagnostic, the queries become less complex since only those variables need be retrieved. Therefore, we would expect the data model in a clinical system (perhaps incorporated into an RIS or PACS) to be a simpler subset of that used for research. However, this database is designed for the more complex research environment. The database must also accommodate the temporal nature of quantitative imaging. For example, as a patient is imaged repeatedly to monitor therapy, quantitative data at different time points must be stored and queried. The architecture must support multiple “experiments” in a research environment. For example, the emphysema and lung cancer projects are separate quantitative imaging experiments. Accommodating multiple experiments within a single infrastructure requires an open flexible database design. Furthermore, a single unified database for all experiments should enable data mining across experiments, i.e., on all available data. Many separate clinical trials and research projects are underway and pooling of the data from these studies has great potential for answering broader clinical questions. The database must accommodate a large amount of quantitative data. Each experiment will contain many subjects, who will have multiple image series. Each series will be segmented into multiple ROIs from which multiple measures will be derived. The multiplicity results in the large number of quantitative data points to be acquired and stored. Fig. 1 gives an overview of the process model for QIA. The following subsections describe
BROWN et al.: DATABASE DESIGN AND IMPLEMENTATION FOR QIA RESEARCH
101
was acquired with an appropriate protocol. Assignment to an experiment effectively specifies what QIA methods should be applied to the series. C. Segmentation
Fig. 1. Process model.
the steps in the process model, their database requirements, and any user intervention required. These steps include image acquisition, review of images, image segmentation, quantitative analysis, reporting, quality assurance, and data mining. A. Image Data Acquisition Digital images are acquired from a CT scanner. They are then transferred to an image server within the quantitative analysis infrastructure and relevant fields are populated in the database pertaining to the patient and CT technical factors. The image data is too large to be stored directly in the database (approximately 50 MB for a single CT series of 100 images), so the database stores a path to the image data files on the server. Patient confidentiality is protected in clinical trials, so the patient name and other identifiers are replaced with a study identifier in the image header prior to transfer. Thus, patients are “anonymized” in the database. A high volume of data is acquired, both in terms of the number of patients and number of images per patient. For example, in the multicenter National Lung Screening Trial (NLST), the target enrolment from 30 participating sites is 50 000 subjects. One half of these (25 000) will be randomized to CT and each have 600–900 images (300–450 MB) acquired over the course of the trial. This will ultimately result in large volumes of derived quantitative data from multiple experiments that share this common pool of image data. B. Review of Incoming Series A research associate reviews each incoming imaging study. Proper image acquisition and image quality is checked. Each imaging study may consist of multiple series with different CT technical factors. The associate makes sure all series are acquired correctly. All images from multiple experiments are transferred to the same server and must be organized within the unified database. The associate identifies which patients were enrolled in a particular experiment(s) and assigns the relevant series to that experiment. For example, volumetric series through the entire lung with thin collimation may be assigned to a nodule detection experiment. This assignment is performed by a human operator who verifies that this series
A feature of the quantitative imaging environment is a computer vision system that automatically identifies and segments anatomic ROIs in CT images [16]–[18]. The automated system performs segmentation of specific regions from specific series for specific experiments (e.g., lung regions or lung tumors). It uses an anatomical model to guide segmentation and object recognition. The model used in the applications described here provides a parametric description of thoracic anatomy including the lungs, central airways, and lung lobes (although the framework allows for parametric modeling of other anatomy). The expected size, shape, topology, and X-ray attenuation of anatomical structures are stored as features in the model. These features are used to guide three-dimensional (3-D) segmentation and object recognition, which is accomplished by matching objects in the image to anatomical objects in the model. Features extracted from anatomical objects in the model and from segmented image regions are compared by an inference engine during the matching phase [19]. The inference engine uses fuzzy logic to match (label) the segmented regions [20]. Anatomically labeled ROIs are stored in the database (as will be described later in Section III-D). The segmentation results may be edited as necessary. For example, if the lung segmentation incorrectly labels the airspace in the trachea as part of the lung fields, then it needs to be manually edited out. Since the editing process may cause interobserver variability in derived quantitative measures, the database must store fully automated segmentation results and edited segmentations from multiple readers. D. QIA Following segmentation, quantitative data are derived from particular segmented regions, for example, the volume of the lung or its mean X-ray attenuation. These quantitative measures must be stored in the database and associated with the appropriate region (segmentation result), e.g., the mean attenuation of the left lung. The database must accommodate many different quantitative measures since there are a large number of possible image features to select from when building classifiers for different diseases. For example, numerous texture measures are available for classification of lung diseases [21]–[23]. The quantitative analysis is performed automatically by querying the database to retrieve the relevant region of interest based on its anatomic label, performing the necessary computations, and then writing the quantitative data to the appropriate tables (to be described in Section III-D) The choice of the anatomic regions to be analyzed and the algorithms to be applied is specific to an experiment (CAD application). In Fig. 1, the image analysis is shown as part of the workflow for a patient. However, since the quantitative analyses are fully automated, they may also be performed offline or retrospectively.
102
Fig. 2.
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 9, NO. 1, MARCH 2005
Overview of data model.
E. Reporting Typically, a report will be generated for a patient immediately following QIA. This report will usually include a comparison with measurements from any available prior imaging studies. This prior information is queried from the database so that disease progression/regression can be reported. This report may be used in decisions about therapeutic interventions or ordering additional diagnostic tests. F. Quality Assurance Checks The quantitative environment must enable handling of noisy and missing data, particularly problems with incorrect or inconsistent image header information. For example, a single patient participating in multiple clinical trials (experiments) may be assigned a different identifier for each trial. If these identifiers are used to populate the “patient id” field of the image header, it is possible that a single patient will have different identifiers for different imaging studies. Noisy data is much harder to control in multicenter studies. Therefore, mechanisms for consistency checks and corrections must be put in place. Quality assurance checks must also be performed to make sure the imaging protocol was consistent when trying to do temporal analysis of derived quantitative data for a patient imaged on multiple visits. The imaging protocol must be identical across all visits if the derived quantitative data is to be reliably compared. To check the correctness of manual user interactions in the analysis process, double data analysis is typically performed on a subset of cases. G. Data Mining and Statistical Analysis Direct connections from quantitative analysis and statistical analysis workstations to the database are vital. Intermediate files should not be used to pass data between applications and database and vice versa. Given the large volumes of data and complexity of imaging and analysis protocols, there is a high risk
of mislabeling intermediate data files and, thus, incorrectly importing them into statistical analysis packages. Direct connection between image display/analysis software and database is also important for rapid retrieval of quantitative data during a clinical interpretation, e.g., lung nodule volumes from a prior study where change in nodule size is being measured. Data mining must be possible across experiments to allow pooling of data derived from different patient populations to answer broader medical research questions. This is a major motivating factor for designing a single database that can accommodate multiple experiments. The data model and database are designed to meet this unique set of requirements for quantitative imaging in a medical research and development setting. III. DATA MODEL The data model is hierarchical, as shown in the high-level overview in Fig. 2. Near the root of the hierarchy, tables have been designed that are common across experiments, and this commonality is preserved as far down the hierarchy as possible. This is important for extension of the data model and data mining across experiments. We now review the major components of the data model. A. Experiment An experiment indicates a particular research project or question with specific derived quantitative measures involving a particular cohort of patients. Each experiment has a specific set of quantitative tools associated with it, and the image analysis user interface is automatically customized to present only the relevant tools. Each experiment is assigned an ID number (primary key) in the Experiment table, so that rows in other tables can be associated with it. For example, rows in the Series table are associated with an experiment using this ID, which effectively assigns patients to experiments. The association permits patients to be assigned to multiple experiments.
BROWN et al.: DATABASE DESIGN AND IMPLEMENTATION FOR QIA RESEARCH
Fig. 3.
103
Patient browser component of the user interface.
B. Patient/Image Data The Digital Imaging and Communications in Medicine (DICOM) data hierarchy1 is used for storing patient and image acquisition information. The hierarchy has the patient at its root which includes name, ID, date of birth, etc. A patient has multiple imaging studies, where a study is typically a single imaging exam during a single patient visit. Each study consists of multiple CT scans (series), and a series is made up of multiple cross-sectional images. A single CT image is approximately 0.5 MB, so a typical series of 100 images occupies 50 MB, making them too large to store directly in the database. Therefore, each row in the Image table stores the location (path) of a DICOM image file, which is typically on a dedicated server but could be distributed. The attributes of these tables are the fields specified in the DICOM standard. C. Users Users of the system are defined in a User table and are each assigned a unique ID (primary key). They are granted access to experiment data via an association between user IDs and experiment IDs. A Preferences table is used to specify graphical user interface layout preferences. D. Quantitative Data Tables above the image level are common to all experiments. Experiments tend to diverge below this level because of the var1Digital Imaging and Communications in Medicine Web Page, http://medical.nema.org/dicom/2003.html.
ious types of quantitative data. However, tables are common whenever possible so that data mining can be done across experiments. One such example is segmentation results. Automated segmentation is driven by parametric models of anatomy of interest [17], [18]. The Segmentation_Model table stores the different models, and the Segmentation_Result table is a child of the Series (image data) and Segmentation_Model tables. Segmentation results are associated at the series level since most ROIs are 3-D. ROIs are represented by a run-length coding [24] and stored as text fields in the Segmentation_Result table along with an anatomic label. In Fig. 2, there is an overview of quantitative tables from two experiments. We will describe these experiments and their tables in more detail in Section V. Fig. 2 shows a simplified overview of the data model. The actual data model contains 20 common tables for patient, image, and segmentation data. For the two experiments described here, lung cancer and emphysema, there are 11 quantitative tables relating specifically to these experiments. In these 31 tables, there are a total of 231 attributes. IV. QUERIES Different types of users will typically operate on different stages of the process model. Some will query, some will update, but many queries are common or similar. The following subsections describe key access methods and associated queries. A. Patient Browser The most common user interface is the patient browser. It is the view through which experiment, patient, and imaging study
104
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 9, NO. 1, MARCH 2005
information is obtained. It is a composition of data from five tables: Experiment, Patient, Studies, Series, and Image. Fig. 3 shows a snapshot of the patient browser. The lists displayed in the patient browser follow the DICOM hierarchy with the data displayed in each list being based on the selection made by the user from the parent list in the hierarchy. To form the list of experiments that are accessible to the given user:
(1) The Exper_User table associates experiments and users, thereby granting users access to experiments. The user_id value used in the query depends on the user currently logged in. The following query will select all patients, studies, and series belonging to a given experiment:
the given type of series must be selected from the Segmentation_Model table. Output from the automated segmentation is written to the Segmentation_Result table.
C. ROI and Quantitative Data When performing QIA, the Segmentation_Result table is queried, based on the anatomic label, to obtain the ROI where analysis is to be performed (the labeled ROI is extracted automatically as part of the image segmentation described in Sections II-C and IV-B). Required gray-level image data is again obtained from the MedicalImageSequence object. Then quantitative measures can be computed and written to the appropriate quantitative tables in the database. For example, for a given thoracic CT scan, the segmentation system can extract the lungs as a region of interest and their volume can be computed. This volume is written to the Lung_Volume table along with the foreign key from the associated row in the Segmentation_Result table. Once analysis is completed, the quantitative data can be queried. For example, to obtain the change in lung volume for a given patient over all of their visits (imaging studies). This query requires joins on the DICOM hierarchy (Query 2) to obtain all series for the given patient, then the following query:
(2) (3) The attributes study_instance_uid and series_instance_uid are primary keys for the Study and Series tables, respectively, and are provided as part of the DICOM standard. The Exper_Series table associates series with experiments.
where, the series_instance_uid values will be provided from Query 2.
B. Image Series Manipulation/Segmentation
D. Data Mining Across Experiments
CT provides volumetric image data, so image display and processing is often performed in three dimensions. Therefore, all images in the series are typically read into a single data structure in memory called MedicalImageSequence. To form this data structure, the system must read all pixel values from all images in the series. This involves selecting all rows from the Image table that have the appropriate series_instance_uid as foreign key. From these rows, the file names of all images that make up the series can be obtained. MedicalImageSequence is the gateway to the original image data for segmentation, QIA, and display. Before analysis, the image data must be segmented; this requires raw pixel data (gray-levels) and voxel sizes which are stored in the MedicalImageSequence object. The appropriate segmentation model for
These are global nonhierarchical queries in the sense that they mine pooled data from all experiments. They enable new research questions to be asked based on the wealth of pooled image and quantitative data that will reside in the unified multipurpose database. For example, lung segmentation is performed on both lung cancer and emphysema subjects (among others). Therefore, to look at global trends and distributions of lung volumes, a query can be performed independent of experiment. Specific quantitative data derived may be different between experiments. For example, some experiments may choose not to compute lung volume. However, segmentation tables are common, so a simple query can retrieve all lung segmentation results independent of experiment and lung volumes computed when necessary so that they can be added to the pooled data.
BROWN et al.: DATABASE DESIGN AND IMPLEMENTATION FOR QIA RESEARCH
105
marks nodules that are stored in the Nodule_Marking table. The automated nodule detection system produces a segmentation that is stored in the Segmentation_Result table. For convenience, ROIs corresponding to segmented (system-detected) nodules are stored in the Nodule_System table. The system-detected nodules are displayed for the observer, who can mark any previously missed nodules. These additional markings are also written to the Nodule_Marking table. Fig. 4 shows a lung nodule with boundaries and “diameters” marked. Upon completion of data collection, the Nodule_Marking table can be queried to obtain information about the number of additional nodules detected when the computer assistance was available. The corresponding increase in sensitivity can be computed as a measure of the usefulness of the CAD system. B. Experiment #2—Emphysema Lung Treatment Trial
Fig. 4. Lung nodule.
V. IMPLEMENTATION AND RESULTS The database was implemented in PostgreSQL2 on a UNIX server. There are a number of different interfaces provided to the database for different steps in the process model. Image data acquisition, review of incoming series, and QIA are performed using an image viewing and analysis application written in Java. This application accesses the database using the Java Database Connectivity Application Programming Interface. Quality assurance checks and data mining are performed by querying the database via Microsoft Excel and Access using a Postgres Open Database Connectivity (ODBC) driver.3 A web interface through the PHP (PHP Hypertext Preprocessor) scripting language is also provided for some experiments and quantitative data. Statistical packages, such as R [25], can also query PostgreSQL using ODBC or direct PostgreSQL drivers. A. Experiment #1—Lung Nodule Detection The overall goal of this work is the development of an automated computer system for detecting lung nodules (potential indicators of lung cancer) in CT exams. The computer-aided detection (CAD) system uses a parametric model of anatomy to recognize segmented lung nodules separate from anatomic structures with similar appearance such as pulmonary vessels [5], [17]. The hypothesis of this particular experiment was that computer assistance improves a readers’ accuracy of detecting lung nodules in low-dose thin-section CT. To test this hypothesis, changes in sensitivity (percent of nodules detected) are measured between reading without and then with the CAD system [5], [26]. Details about each reader such as level of expertise are recorded in the Nodule_Reader table (see Fig. 2). In the experiment, an observer reads a series without the CAD system and 2The PostgreSQL global development group. [Online]. Available: http://www.postgresql.org. 3PostgreSQL database connectivity (ODBC) driver. [Online]. Available: http://gborg.postgresql.org/project/psqlodbc/projdisplay.php.
The Feasibility of Retinoid Therapy for Emphysema (FORTE) trial is a preliminary study into the effectiveness of retinoic acid as a treatment for emphysema. As part of the FORTE study, subjects are imaged on three visits (studies), each separated by three to six months. Subjects are randomized into one of two groups. One of these groups is on treatment between Visits 1 and 2 and placebo between Visits 2 and 3. The other group is on placebo and then treatment. As part of the trial, the research team is blinded as to which group is which, and all subjects undergo a thoracic CT exam as part of their workup. The goal of the QIA is to determine whether there is a quantitative difference between the imaging data from the two groups that may indicate whether the drug is effective. Specifically, the quantitative data must be mined to identify quantitative measures that show a difference between the groups. During QIA, the lung is segmented into various regions. Changes in lung volume and the percentage of lung involved in emphysema (based on X-ray attenuation histograms) are measured from images acquired at each visit and used to determine whether the treatment is having an effect. Texture measures are also computed in each region, and will be investigated later as part of an additional research question to determine whether these may offer better characterization of emphysema lung than the attenuation histograms. These quantitative measures are stored in the Lung_Histogram, Lung_Texture, and Lung_Volume tables. Fig. 5 shows a cross-sectional lung CT image with emphysematous regions highlighted in gray within the lung fields. These regions are determined automatically based on their X-ray attenuation: Lungs destroyed by emphysema have lower attenuation than the surrounding lung parenchyma. The database infrastructure is important to enable the large number of quantitative measures (see Section V-C) to be mined to look for trends across the three imaging studies for each patient. C. Overall Database Size The database currently stores data from nine experiments. Table I shows the number of rows in some of the tables described. The table shows that since there is usually a large number of images in each CT series, there are a large number of rows in the Image table. Some quantitative measurements, such as histograms, are made on multiple subregions within the lung field, from multiple series of a single patient. Therefore,
106
Fig. 5.
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 9, NO. 1, MARCH 2005
Computer-detected regions of emphysema on a lung CT image.
TABLE I NUMBER OF ROWS IN SELECTED TABLES IN THE DATABASE
paper because the grid could provide access to large numbers of distributed image data sets. It could also provide access to high-performance computing resources for execution of image processing algorithms. The system design also supports scaling of the hardware platform. For example, the data repository where images are physically stored can be added to and adjusted dynamically. The database stores the relative path to each image and a foreign key to a table that contains information as to how a given client workstation can translate that relative path to an absolute path. This allows the data repository to be split across multiple devices or additional devices to be added as the need arises. If a client workstation does not have sufficient processing power for a given QIA protocol the numerical processing can be distributed across multiple workstations using the built-in support for distributed computing in Java [28]. VII. DISCUSSION
they also generate many rows for the numerous patients within the database. VI. SCALABILITY The process and data models are designed to scale to include additional experiments and CAD systems, potentially for other organ systems. A new CAD system can be developed and tested within this research architecture by adding segmentation and quantitation modules. Output from a new segmentation module is written in the form of labeled ROIs as described in Section III-D. To accommodate new quantitative data, additional tables can be added as leaves in the data model shown in Fig. 2. Quantitative tables from two different experiments (emphysema and nodule detection) are shown in Fig. 2, and more can be added. If a new CAD system generates quantitative data that is similar to an existing system, then existing tables should be used as far as possible. This facilitates pooling of data to test global hypotheses and search for trends. We expect the tables for all experiments to be identical above the image level. However, there may be quite a variety of quantitative tables, for example, for quantitative data associated with fused images tables may contain multiple image series identifiers as foreign keys. The data model and database described here could also be implemented within a grid architecture, with the database residing in the Collective layer and referencing large medical image data sets stored in repositories within the Fabric layer [27]. Implementation within the grid could enhance the quantitative imaging research environment described in this
The quantitative data management capabilities have been essential to enable experimental studies to move from tens to hundreds of subjects. Data collection across various experiments is moving rapidly and we expect to have over 1000 subjects in the database within the next six months. Automated computer analyses have already generated hundreds of thousands of numerical data points with the current number of subjects. An important issue in all medical research is patient confidentiality and data security. As mentioned in Section II, image data that comes to our laboratory from other sites is anonymized prior to transmittal, with patient names and identifiers replaced with study-assigned ID numbers. The database and analysis workstations are behind a network firewall. Incoming image data is transferred to a dedicated data transfer server that is accessible only to designated Internet protocol addresses at sites participating in multicenter studies. When images are received, they are moved from this machine to servers behind the firewall. Standards for security of medical data are currently being adopted based on the Standards for Privacy of Individually Identifiable Health Information (HIPAA) defined by the Department of Health and Human Services,4 so this is an evolving area for medical informatics. New medical image acquisition techniques are generating increasing numbers of images. Computer vision systems are being developed to abstract the data and extract quantitative measurements to assist in diagnosis. RIS and PACS systems for patient records and image data have been the subject of much research and development [11], [12], but databases for quantitative medical image data have received relatively little attention. The architecture described in this paper achieves this and goes one step further by integrating computer image analysis techniques with the quantitative results being stored in the relational database. However, such systems are essential if CAD is to reach its full potential. We have developed a flexible database design to handle the large amounts of quantitative data to be stored, organized, and mined in a research environment. This is an important step as CAD from medical images continues to evolve. 4U.S. Department of Heath http://www.hhs.gov/ocr/hipaa/.
and
Human
Service
Web
Page,
BROWN et al.: DATABASE DESIGN AND IMPLEMENTATION FOR QIA RESEARCH
REFERENCES [1] S. Sone, S. Takashima, and F. Li et al., “Mass screening for lung cancer with mobile spiral computed tomography scanner,” Lancet, vol. 351, pp. 1242–1245, 1998. [2] C. I. Henschke, D. I. McCauley, and D. F. Yankelovitz et al., “Early lung cancer action project: overall design and findings from baseline screening,” Lancet, vol. 354, pp. 99–105, 1999. [3] N. L. Muller, C. A. Staples, R. R. Miller, and R. T. Abboud, “‘Density mask’: An objective method to quantitate emphysema using computed tomography,” Chest, vol. 94, pp. 782–787, 1998. [4] H. O. Coxson, R. M. Rogers, K. P. Whittall, Y. D’Yachkova, P. D. Pare, F. C. Sciurba, and J. C. Hogg, “The measurement of lung expansion with computed tomography and comparison with quantitative histology,” J. Appl. Physiol., vol. 79, pp. 1525–1530, 1995. [5] M. S. Brown, J. G. Goldin, R. D. Suh, M. F. McNitt-Gray, J. W. Sayre, and D. R. Aberle, “Automatic detection of lung micronodules in high resolution CT images—preliminary experience,” Radiology, vol. 226, pp. 256–262, 2003. [6] S. G. Armato, F. Li, M. L. Giger, H. MacMahon, S. Sone, and K. Doi, “Lung cancer: Performance of automated nodule detection applied to cancers missed in a CT screening program,” Radiology, vol. 225, pp. 685–692, Dec. 2000. [7] M. F. McNitt-Gray, E. M. Hart, N. Wyckoff, J. W. Sayre, J. Goldin, and D. R. Aberle, “A pattern classification approach to characterizing solitary pulmonary nodules imaged on high resolution CT: Preliminary results,” Med. Phys., vol. 26, pp. 880–888, 1999. [8] S. J. Swensen, R. W. Viggiano, D. E. Midthun, N. L. Muller, A. Sherrick, K. Yamashita, D. P. Naidich, E. F. Patz, T. E. Hartman, J. R. Muhm, and A. L. Weaver, “Lung nodule enhancement at CT: multicenter study,” Radiology, vol. 214, no. 1, pp. 73–80, Jan. 2000. [9] J. T. Mao, J. G. Goldin, J. Dermand, G. Ibrahim, M. S. Brown, A. Emerick, M. F. McNitt-Gray, D. W. Gjertson, F. Estrada, D. P. Tashkin, and M. D. Roth, “A pilot study of all-trans-retinoic acid for the treatment of human emphysema,” Amer. J. Respir. Crit. Care Med., vol. 165, no. 5, pp. 718–723, 2002. [10] M. F. McNitt-Gray, S. G. Aramato, L. P. Clarke, G. McLennan, C. R. Meyer, and D. F. Yankelevitz, “The lung imaging database consortium: creating a resource for the image processing research community,” Radiology, vol. 225(p), pp. 739–748, 2002. [11] S. T. C. Wong and H. K. Huang, “Design methods and architectural issues of integrated medical image data base systems,” Comput. Med. Imag. Graph., vol. 20, no. 4, pp. 285–299, 1996. [12] J. Pereira, A. Castro, A. Castro, B. Arcay, and A. Pazos, “Construction of a system for the access, storage and exploitation of data and medical images generated in radiology information systems (RIS),” Med. Inform., vol. 27, no. 3, pp. 203–218, 2002. [13] M. S. Brown, M. F. McNitt-Gray, J. G. Goldin, L. E. Greaser, and D. R. Aberle, “Knowledge-based method for segmentation and quantitative analysis of lung function from CT. Computer-Aided Diagnosis in Medical Imaging,” in Proc. 1st Int. Workshop Computer-Aided Diagnosis, Chicago, IL, Sep. 20–23, 1998, pp. 113–118. [14] M. S. Brown, M. F. McNitt-Gray, J. G. Goldin, L. E. Greaser, U. M. Hayward, J. W. Sayre, M. K. Arid, and D. A. Aberle, “Automated measurement of single and total lung volume from CT,” J. Comput. Assisted Tomography, vol. 23, no. 4, pp. 632–640, 1999. [15] M. S. Brown, J. G. Goldin, M. F. McNitt-Gray, L. E. Greaser, A. Sapra, K. T. Li, J. W. Sayre, K. Martin, and D. R. Aberle, “Knowledge-based segmentation of thoracic CT images for assessment of split lung function,” Med. Phys., vol. 27, no. 3, pp. 592–598, 2000. [16] M. S. Brown, M. F. McNitt-Gray, J. G. Goldin, and D. R. Aberle, “Extensible knowledge-based architecture for segmenting CT data,” in Proc. SPIE Med. Imag. 1998: Image Processing, vol. 3338, 1988, pp. 564–574. [17] M. S. Brown, M. F. McNitt-Gray, N. J. Mankovich, J. Hiller, L. S. Wilson, J. G. Goldin, and D. R. Aberle, “Method for segmenting chest CT image data using an anatomical model: Preliminary results,” IEEE Trans. Med. Imag., vol. 16, no. 6, pp. 828–839, Dec. 1997. [18] M. S. Brown, M. F. McNitt-Gray, J. G. Goldin, and D. R. Aberle, “Extensible knowledge-based architecture for segmenting CT data,” in Proc. SPIE Medical Imaging 1998: Image Processing, vol. 3338, 1998, pp. 564–574. [19] M. S. Brown, R. W. Gill, H. E. Talhami, L. S. Wilson, and B. D. Doust, “Model-based assessment of lung structures: inferencing and control system,” in Proc. SPIE Medical Imaging 1996: Physiology and Function From Multidimensional Images, E. A. Hoffman, Ed., 1995, Paper 2433, pp. 167–178. [20] L. A. Zadeh, “Fuzzy sets,” Inform. Contr., vol. 8, pp. 338–353, 1965. [21] R. M. Haralick, K. Shanmugamm, and I. Dinstein, “Textural feature for image classification,” IEEE Trans. Syst., Man, Cybern., vol. SMC-3, no. 6, pp. 610–621, 1973. [22] R. M. Haralick, “Statistical and structural approaches to texture,” Proc. IEEE, vol. 67, no. 5, pp. 786–804, 1979.
107
[23] R. Uppaluri, E. A. Hoffman, M. Sonka, P. G. Hartley, G. W. Hunninghake, and G. McLennan, “Computer recognition of regional lung disease patterns,” Amer. J. Respiratory Critical Care Med., vol. 160, no. 2, pp. 648–654, Aug. 1999. [24] M. S. Brown, M. F. McNitt-Gray, N. Wyckoff, and A. Bui, “Objectoriented region of interest toolkit for workstations,” in Proc. SPIE Med. Imag. 1998: Image Display, vol. 3335, 1998, pp. 627–636. [25] R. Ihaka and R. Gentleman, “R: A language for data analysis and graphics,” J. Comput. Graph. Statist., vol. 5, no. 3, pp. 299–314, 1996. [26] M. S. Brown, J. G. Goldin, R. D. Suh, M. F. McNitt-Gray, J. W. Sayer, and D. R. Aberle, “Lung micronodules: Automated method for detection at thin-section CT-initial experience,” Radiology, pp. 226–256, 2003. [27] I. Foster, C. Kesselman, and S. Tuecke, “The anatomy of the grid: Enabling scalable virtual organizations,” Int. J. Supercomput. Applicat., vol. 15, no. 3, pp. 200–222, 2001. [28] R. Bao. Distributing computing via RMI and CORBA. presented at Jini and Advanced Features of Java. [Online]. Available: http://www.cs.helsinki.fi/u/campa/teaching/bao-final.pdf
Matthew S. Brown is an Assistant Professor in the UCLA Department of Radiological Sciences. He is a Co-Director of the UCLA Thoracic Imaging Research Laboratory and leader of its Computer Vision Team. His research interests include computer analysis of medical images and the development of automated systems to augment and assist in the diagnostic process.
Sumit K. Shah received the B.A. degree in physics from the University of Dublin, Ireland, in 1999. He is currently working toward the Ph.D. degree in biomedical physics at UCLA, Los Angeles, CA.
Richard C. Pais received a degree from Bombay University, India. He is a Software Developer with the Thoracic Research Group. He has been involved in the design and modeling of relational database management systems for over ten years. Currently his interest is in the use of XML-related technologies with special emphasis on XML Schema, XSLT, XSL-FO, and XPath.
Yeng-Zhong Lee (S’03) received the M.S. degree in computer science from University of California, Los Angeles, in 2001. He is currently working toward the Ph.D. degree at UCLA, Los Angeles, CA. He joined the Wireless Adaptive Mobility Laboratory of Prof. Mario Gerla in 2000 at UCLA-CSD. His research interests include Bluetooth, routing protocols, and security for ad-hoc networks.
Michael F. McNitt-Gray (S’91–M’93) received the B.S. degree in electrical engineering from Washington University, St. Louis, MO, in 1979, the M.S. degree in electrical engineering from Carnegie-Mellon University, Pittsburgh, PA, in 1980, and the Ph.D. degree in biomedical physics from UCLA, Los Angeles, CA, in 1993. From 1994 to 2001, he was an Assistant Professor in the Medical Imaging Division, Department of Radiological Sciences, University of California, Los Angeles. He is currently an Associate Professor in Thoracic Imaging, Department of Radiological Sciences, University of California. Dr. McNitt-Gray received the James T. Case Radiological Foundation award in 1994 and the Whitaker Foundation Grant in 1996.
Jonathan G. Goldin received the Mb.Chb. degree in medicine and the Ph.D. degree from the University of Cape Town Medical School, in 1983 and 1990, respectively. From 1997 to 2001, he was an Assistant Professor in the Biomedical Physics Program, University of California, Los Angeles. He is currently Associate Professor of Thoracic Imaging and Associate Professor in the Biomedical Physics Program, David Geffen School of Medicine, UCLA, Los Angeles, CA.
108
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 9, NO. 1, MARCH 2005
Alfonso F. Cardenas received the B.S. degree from San Diego State University, San Diego, CA, and the M.S. and Ph.D. degrees in computer science from the University of California, Los Angeles, in 1969. He is a Professor in the Computer Science Department, Henry Samueli School of Engineering and Applied Science, UCLA, Los Angeles, CA. His major areas of interest include database management, distributed heterogeneous and multimedia (text, image/picture, voice) systems, information systems planning and development methodologies, software engineering, medical informatics, and legal and intellectual property issues.
Denise R. Aberle received the B.A. degree in French literature (with honors) and the M.D. degree from the University of Kansas, in 1975 and 1979, respectively. From 1987 to 1988, he was an Assistant Professor, and from 1992 to 1997, he was an Associate Professor in the Department of Radiological Sciences, University of California, Los Angeles. He is currently Chief of Thoracic Imaging, University of California, and Professor and Vice-Chair of Research, Department of Radiology, University of California.