Evaluating a row-store data model for full-content DICOM management
Alexandre Savaris1,2,3, Theo Härder1, Aldo von Wangenheim2,3 1
University of Kaiserslautern – Dept. of Computer Science – AG DBIS – Kaiserslautern – Germany 2 Federal University of Paraná (UFPR) – Dept. of Informatics – Curitiba – PR – Brazil 3 National Institute for Digital Convergence (INCoD) – Florianópolis – SC – Brazil
[email protected],
[email protected],
[email protected]
Abstract—The management of data acquired in a daily basis, together with historical data kept available during variable time frames, became a practical concern to the healthcare practice. Data organized according to the DICOM (Digital Imaging and Communications in Medicine) standard contribute to this scenario, demanding solutions for storage and retrieval of metadata and images that lie beyond the customary file system approach. As a way of experimentation, this work defines and implements a data model using a NoSQL, rowstore database, aiming to evaluate its suitability on managing DICOM content at tag level, as well as its performance when compared to a commonly approach based on a relational schema. The obtained results attest the capacity of the proposed model in storing full-content DICOM images, using tags as data units. In terms of performance, however, the data model is outperformed by the relational approach. Keywords—DICOM; Databases; RDBMS; NoSQL
I.
INTRODUCTION
Healthcare institutions are responsible for the generation, on a daily basis, of huge volumes of structured, semistructured, and even unstructured operational data. The maintenance of such content is paramount, being regulated by local policies established by each institution, as well as federal and state laws. Furthermore, the availability of historical data contributes in enhancing the medical practice itself. Examples can be seen in identifying subtle relationships over historical Electronic Medical Record (EMR) content (as a way of prediction supported by pastidentified events), in developing and manufacturing new medical devices, and in defining better workflows and visualization tools [1]-[5]. Despite the fact that a relevant portion of data in the healthcare domain is organized according to the DICOM (Digital Imaging and Communications in Medicine) standard, few works have explored the storage and/or analysis of this type of content using NoSQL technologies [6], [7]. Key-value databases, column- and row-oriented databases and document databases arise as alternatives, improving the raw file-system-based storage and loosening the constraints of the relational model, in addition to the even growing need of scalability. Open issues on this topic include the modeling of DICOM content according to the features of these database architectures and the study of data partitioning strategies aiming at better distribution results. Both issues are This work was supported by CNPq – National Council for Scientific and Technological Development – Brazil.
relevant when effectiveness and performance on search operations and data retrieval are demanded by high-level applications. In this work, a number of experiments are performed aiming to provide initial answers to the following questions: 1. Is it possible to manage full-content DICOM images at tag level, using a data model built over a rowstore, NoSQL database? 2. Despite its close relationship with big volumes of data, does a row-store, NoSQL database perform well in scenarios of small datasets when compared to known approaches, i.e., relational databases? To guide the answering of these questions, the current work is organized as follows. Section II summarizes the DICOM image organization, focusing on file and data element characteristics, and briefly describes the query and retrieval guidelines established by the standard. Section III proposes a row-store data model, designed according to the concepts presented in section II. Experiments involving the proposed data model are described in section IV, the acquired results (compared to results of a relational model) are discussed in section V, and the overall conclusions are presented in section VI. II.
OVERVIEW OF DICOM ORGANIZATION
A. Structure of DICOM content The DICOM standard, originally developed by the American College of Radiology (ACR) and the National Electrical Manufacturers Association (NEMA), comprises a set of non-proprietary specifications regarding structure, format, and exchange protocols for digital-based medical images. First released in 1985 as ACR/NEMA 300, the standard evolves according to deliberations of the DICOM Standards Committee, a number of workgroups formed by manufacturers of medical devices, academies and scientific societies [8], [9]. Data in DICOM images are organized through Information Object Definitions (IODs), which semantically group together series of fine-grained data elements. Each IOD is associated with a series of DICOM Message Service Elements (DIMSEs) through Service-Object Pairs (SOPs), defining which operations (e.g., storage, query, retrieval) can be performed. The data exchange and service execution is
negotiated based on a point-to-point communication, involving two Application Entities (AEs) that assume, each, one of two possible profiles during the interaction: Service Class User (SCU) or Service Class Provider (SCP) [10]. At the lowest organizational level, tags identified by group/element ordered pairs compose data elements. Group numbers correlate elements based on the loose similarity of their content, and element numbers identify individual attributes in each group. DICOM tags are characterized by Value Representations (VRs) and Value Multiplicities (VMs), which specify content data types, formatting rules, and number of occurrences allowed for the content of each tag [11]. The standard defines a data dictionary composed by a set of tags with reserved group/element values, allowing its expansion through the introduction of proprietary tags [12]. At the file level, a DICOM image is structured as a set of DICOM tags. The number of tags in a file varies according to the availability of information during the examination scheduling and execution, as well as the examination modality to be performed (e.g., Computed Tomography, Magnetic Resonance) and the medical device manufacturer. The real content of a DICOM file is, therefore, known only for sure at evaluation time. B. Query/retrieval directives The DICOM standard specifies how query and retrieval operations must be performed to guarantee interoperability among different implementations. A minimum level of conformance is expected through a baseline behavior, which can be enhanced by an extended behavior with additional features. Basically, query and/or retrieval in DICOM use the Entity-Relationship Model Definition, built over the standard-specific patient/study/series/image hierarchy. For each level of the hierarchy, a number of attributes can be used as unique, required, or optional search keys, in an isolated or combined way. Each attribute can be compared with single, list, range, sequence, and wildcardcomplemented values, providing high flexibility in the construction of search predicates. Currently, the standard supports two Query/Retrieve Information Models: Patient Root and Study Root. Each information model identifies the top level on the hierarchy from which the search takes place. Four Query/Retrieve Level Values limit the query: Patient, Study, Series and Image. Each value defines the bottom level on the hierarchy to which the search is executed. Together, Information Model and Level Value define a hierarchical range that must be traversed from top to bottom. In order to explore such an organization, the baseline behavior establishes that AEs responsible for answering to queries shall be capable of processing Hierarchical Search Methods. These methods recursively traverse the hierarchical tree performing the following steps: 1. The search starts at the level indicated by the Query/Retrieve Information Model, descending to the level indicated by the Query/Retrieve Level Value (the target level);
2.
When the target level is reached, attributes of the current level are compared with the values supplied by the search request;
3.
Entities whose data satisfy the search criteria are returned, one at a time, together with the upper-level unique identifier attributes.
The extended behavior, in turn, defines RelationalQueries. A Relational-Query allows a multi-level comparison of attributes and retrieval of entities, according to the following steps: 1. The search starts at the level indicated by the Query/Retrieve Information Model, descending to the level indicated by the Query/Retrieve Level Value; 2.
For each level, attributes are compared with the values supplied by the search request. Data from entities that satisfy the search criteria are taken to the next level in the hierarchy;
3.
When the target level is reached, data collected from all levels are returned (one entity at a time) [13].
Despite the algorithmic description provided by the standard, each implementation is responsible for particular decisions and underlying details that lead to a solution. In Relational Database Management Systems (RDBMSs), for instance, both Hierarchical Search Methods and RelationalQueries are translated to one or more SQL (Structured Query Language) queries, whose complexity is derived from the organization of the database schema. III. THE ROW-STORE DATA MODEL Physically, the content of a DICOM file can be seen as structured at the data element level and as semi-structured at the file level. The triple group/element tag, value representation, and value multiplicity is related to individual data elements, which can be nested within other data elements, composing semantic relationships. An image file, in turn, stores variable sets of data elements whose disposition and relationships are known only at evaluation time. Storage strategies for DICOM content must consider both levels, generating models flexible enough to incorporate new tags whenever necessary, and still supply minimal features for the baseline query/retrieval behavior. The row-store data model implemented in this work defines at high level a well-established schema modeled according to the DICOM file content characteristics, allowing the storage of any combination of data elements. Being generalist in terms of content, the model can be used as a back-end for applications that deal with different examination modalities and equipment from a number of manufacturers, including their variable requirements for proprietary tags. The schema’s structure can be seen in Fig. 1.
•
By definition, the content length of elements in a DICOM file is an even number, also in cases where the real length of an element is odd. Elements with odd content lengths are padded with space or null values, according to their VRs. Keeping the original DICOM length is relevant, considering that elements will be queried or sent by/to other AEs, and these operations expect results structured according to the standard rules. To deal with such requirements, the length column in the data model registers the original DICOM length of every entry. IV.
Fig. 1. Row-store data model for full-content DICOM storage. Value columns group different value representations according to their data types. In distributed environments, data are split and replicated among nodes based on the partition key (the patientid column). Clustered key columns are kept ordered by the underlying storage engine, allowing its use in search predicates, as well as columns marked as secondary indexes.
The unique identification of rows is based on a row key. Theoretically, elements of a DICOM file could be used combining their hierarchical identification attributes with group/element values; however, it is common to find multiplicities in combining such attributes on single files. To address this situation, three columns were added to the schema (as row key components), aiming to deal with the following situations: • Group/element pairs (tags) can be repeated on single files through element sequences. Sequences are groups delimited by specific tags and can occur multiple times per file. Nested to the sequences, elements with the same tags can also occur multiple times. For both cases, the tagorder column contributes in distinguishing one instance of a tag from another. The column value is a sequential number, in the context of each file individually, and is generated in parsing time; •
•
Some tags have value multiplicities greater than one. The values of these tags can be split, generating multiple occurrences of the same tag for the same tagorder column value. The item column is used as an individual value-part identifier, differentiating one value from another in the context of the same tag. As the tagorder column, the value of the item column is a sequential number; In some circumstances, a value part can split its content in two or more complementary subparts. This is a minor occurrence, mainly related to compressed image data, which is addressed through the use of the subitem column;
EXPERIMENTS
A. Resources Once defined, the row-store data model was implemented using Apache Cassandra v1.2.4, an open-source partitioned row-store database system maintained by the Apache Software Foundation, based on the architecture of Amazon Dynamo and on the data model of Google Bigtable [14]. Two setups were provided, according to the hardware and software profiles presented in Table I. To allow performance comparisons, an instance of a relational database schema was implemented using MariaDB v10.0.3, according to the data model proposed in [15]. In both setups, there was no fine-tuning of configuration parameters aiming to improve performance. Due to its distributed nature, the cluster setup was configured to use a replication factor of three (meaning that each write to the database was replicated on three different nodes). Reads performed on the cluster setup were configured to use a quorum of two (meaning that the returned data are consistent in, at least, two nodes). The load of DICOM content into the databases and the query/retrieval experiments were performed using a software application developed specially for this purpose. Built in C++, the software uses the DICOM Toolkit (DCMTK) libraries for parsing and extraction of DICOM metadata, and the Apache Thrift framework for communicating with the underlying row storage engine [16], [17]. To evaluate the behavior of the row-store data model in scenarios of heterogeneous content, a number of freely available DICOM sample image sets were used as input [18]. The sample image sets (described in Table II) contemplate different examination modalities, characterized by a variable number of files per study/series and elements per file. B. Limitations The query/retrieval experiments performed in this work are limited to use single-value comparison predicates, involving the unique identifiers defined by the DICOM standard for the patient/study/series/image hierarchical levels. Each query returns data from tags related to a single hierarchical level, which classifies it as a hierarchical search method.
TABLE I. Setup Stand-alone
Cluster
Node 1 Node 2 Node 3 Node 4 Node 5
EXPERIMENTAL SETUPS
Processor
Memory
Storage
Operating System
Intel® CoreTM i7 2,7GHz
4GB DDR3
500GB SATA
OS X 10.8.3
Intel® Xeon® X3440 - 2,53GHz (x8) (shared through virtualization)
4GB DDR3 (per node)
859GB SATA (per node)
Ubuntu 10.04.1
TABLE II.
IMAGE DATASETS USED IN EXPERIMENTS
Examination modality Computed Radiography (CR) X-Ray Angiography (XA) Secondary Capture (SC) Positron Emission Tomography (PET)
Magnetic Resonance (MR) Computed Tomography (CT)
V.
Tags per file (average) 80 120 64 161 159 132
Average size per file (bytes) Metadata Image tags tags 802 5662 932 3085 2704 3888
2278594 1442097 168897 16211 72006 109054
Size on disk (MB) 14 83 151 111 363 3272
RESULTS AND DISCUSSION
A. Full-content Storage The results for the storage operation consider the whole image sets described in Table II and include the parsing step needed to extract individual elements from the DICOM files. Ordering the examination modalities according to the time needed for storage, Fig. 2 depicts the obtained results for the evaluated approaches.
Fig. 2. Time needed for storing the sample image sets. The storage times were acquired in scenarios involving stand-alone (SA) and cluster (CL) setups, with single (SW) and multiple (MW) writing processes (in a number of five), concurrently executed. Due to the wide range of its values, the vertical axis is expressed in logarithmic scale.
The amount of time required to store DICOM content is derived not only from the dataset size, but also from the file content complexity. A clear example can be seen when comparing SC and PET examination modalities. Despite the PET dataset be smaller than the SC dataset, it demands more time to be stored, because the number of data elements per file is almost three times larger. As the number of specific image-related elements is the same on both datasets, the overhead is directly related to the metadata – a drawback when full-content storage is performed.
In a global evaluation, the image sets behave quite similarly in terms of storage time among examination modalities. The stand-alone row-store setup with single- and multiple-writing processes outperforms the cluster setup, requiring less time to store the same amount of data. Considering the accumulated storage time, SA is 89.8% faster than CL. This result is a direct consequence of the missing replication and network communication, which are characteristics present in the cluster setup. Despite the better performance, the stand-alone setup does not aim to be scalable and/or fault tolerant, which limits its use in scenarios of growing data volumes that require high availability. The cluster setup, in turn, explores the features provided by the underlying storage engine to eliminate single points of failure (SPOFs), as well as to expand storage capacity through on-demand addition of new nodes. The use of multiple processes performing parallel writes is a viable alternative to speed-up storage. For both row-store setups, the use of multiple writing processes outperforms its single writing pairs, reducing the storage time in about 77.9%. Particularly in the cluster setup, it is possible to improve the write performance using the peer-to-peer model of the underlying storage engine. In this model, writes can be routed to different nodes in parallel, avoiding bottlenecks related to single-node accesses. However, the use of multiple writing processes is not enough to surpass, entirely, the relational approach. It is faster for CR, XA, and SC, but performs poorly when storing data from PET, MR, and CT. B. Query Metadata The mensuration of query times is performed by choosing a set of data elements related to each hierarchical level, according to the specifications of the DICOM standard. The averaged execution times of 10 repetitions were considered, being generated using randomly selected values for the hierarchical attributes chosen as search predicates. Fig. 3 shows the obtained results, grouped by approach and examination modality, allowing the comparison of values for the four hierarchical levels and their corresponding tags. A common behavior can be observed among examination modalities, where greater sets of data elements require more time to be queried/retrieved. One exception is perceived in the row-store setups regarding CT, in which the four series tags take longer than the 25 image tags. This exceptional occurrence is derived from the optional tags allowed by the DICOM standard. In practice, the number of existent tags for the image level in the evaluation datasets is less than 25, which decreases considerably the search and fetching times required to bring the results to the client application. The time decrease is reinforced by a characteristic of the underlying storage architecture, which uses individual column values as fetching units. Instead of fetching entire rows for a posterior filtering of relevant columns (the common behavior of horizontal architectures), the row store filters the relevant columns in the source. Differences in the selectivity among hierarchical levels, as a complement to the number of tags per level, have a direct impact on the obtained results. For each level, search
Fig. 3. Query times for metadata elements related to DICOM hierarchical levels. Each element is queried separately. The aggregated time for all elements of a specific level is averaged after 10 executions, and the result is plotted in the chart using a logarithmic scale (due to the wide range of its values).
predicates are built using the current-level unique identifier attribute plus the unique identifier attributes from the higher levels. The impact of such selectivity criteria can be visualized in the image-level results. Despite the fact that this level has almost the same number of tags than the study level, the mean time required to execute search operations is considerably smaller. The relational approach presents a well-defined pattern among examination modalities, related to the number of tags per level. In a direct comparison, this strategy performs worse than both row-store setups for some combinations of examination modalities/hierarchical levels; however, in a global evaluation, it is 8.9% and 19.2% faster, respectively, than the SA and the CL setups of the row store. C. Full-content Retrieval Experiments involving full-content retrieval adopt the same strategy used in queries, performing 10 repetitions with random values for the unique identifiers of study, series and image levels. The obtained results are depicted in Fig. 4. As expected, time decreases as selectivity increases. This behavior is observed in both row-store setups, for all examination modalities, with better results achieved in the cluster – despite the network delay inherent to the communication between the client and the cluster nodes. The partitioning strategy adopted by the proposed data model, based on the patient unique identifier, guarantees that all data from a single patient is routed to the same node. This prevents the query forwarding among nodes, and allows that a query routed to a single node executes on smaller datasets, contributing to the performance gain. Comparisons made among examination modalities at image level show that bigger images imply more extensive search times. This behavior cannot be directly related to the study and series levels, due to the variability in the number of images that compose a series and the number of series that compose a study.
Once more, the relational approach takes advantage over the row-store configurations. Being slower for some examination modalities and hierarchical levels (e.g., XA in study/series), it is still 81.7% and 83.2% faster than, respectively, SA and CL setups of the row store (in average global evaluation times). VI.
CONCLUSION
This work evaluates a row-store data model, designed to manage DICOM datasets at tag level, comparing its performance metrics to a most common, relational approach. Experiments include full-content storage and retrieval, and query operations executed on metadata tags. In terms of consistency, the row-store data model is as effective as the relational model, storing the whole datasets without restrictions in terms of tag types, VRs, and VMs. It is, also, flexible enough to support new tag groups/elements and values, without modifications in the underlying storage schema. Both consistency and flexibility allow a positive answer to the first question stated in section I, including NoSQL databases as suitable alternatives for DICOM storage at tag level. Considering overall performance, the relational approach is faster than the row store. Despite some exceptions identifiable in Figs. 2, 3 and 4, the row store is outperformed in most of the experiments. Based on these results, a simplistic answer to the second question in section I is that NoSQL approaches are not suitable for small DICOM datasets; however, as stated earlier, it is possible to improve performance on such architectures through refinements in their data model as well as in their partitioning/distribution policies. Nevertheless, for the experimental setup used in this work, the chosen NoSQL approach/data model is an unfeasible option. As future directions, it is suggested that the second question be further answered through the execution of new experiments, combining different NoSQL databases (and
Fig. 4. Average time measured for retrieval operations. The patient level is not considered, because the retrieval of all images from one patient at once is an uncommon practice. Data in the vertical axis is expressed in logarithmic scale (due to the wide range of its values).
their respective data model guidelines) with variable data partitioning strategies, guided by the DICOM standard precept regarding query/retrieval.
[8]
[9]
REFERENCES [1]
[2]
[3]
[4]
[5]
[6]
[7]
N. H. Shah, J. D. Tenenbaum, “The coming age of data-driven medicine: translational bioinformatics’ next frontier “, J Am Med Inform Assoc, vol. 19(e1), e2-e4, Jun 2012. F. Wang, N. Lee, J. Hu, J. Sun, S. Ebadollahi, A. F. Laine, “A Framework for Mining Signatures from Event Sequences and Its Applications in Healthcare Data“, IEEE Trans Pattern Anal Mach Intell, vol. 35, no. 2, pp. 272-285, Feb 2013. A. G. Erdman, D. F. Keefe, R. Schiestl, “Grand Challenge: Applying Regulatory Science and Big Data to Improve Medical Device Innovation“, IEEE Trans Biomed Eng, vol. 60, no. 3, pp. 700-706, Mar 2013. S. Wandelt, A. Rheinländer, M. Bux, L. Thalheim, B. Haldemann, U. Leser, “Data Management Challenges in Next Generation Sequencing“, Datenbank-Spektrum, vol. 12, no. 3, pp. 161-171, Nov 2012. J. Boyle, R. Kreisberg, R. Bressler, S. Killcoyne, “Methods for visual mining of genomic and proteomic data atlases“, BMC Bioinformatics, vol. 13:58, Apr 2012. S. J. Rascovsky, J. A. Delgado, A. Sanz, V. D. Calvo, G. Castrillón, “Use of CouchDB for Document-based Storage of DICOM Objects“, Radiographics, vol. 32, no. 3, pp. 913-927, Mai-Jun 2012. L. Liu, Q. Huang, “CloudDICOM: A Large-Scale Online Storage and Sharing System for DICOM Images“, Advanced Materials Research, vol. 756-759, pp. 2037-2041, 2013.
[10]
[11]
[12]
[13]
[14] [15]
[16] [17] [18]
W. D. Bidgood Jr., S. C. Horii, F. W. Prior, D. E. Van Syckle, “Understanding and Using DICOM, the Data Interchange Standard for Biomedical Imaging“, J Am Med Inform Assoc, vol. 4, no. 3, pp. 199-212, May-Jun 1997. P. Mildenberger, M. Eichelberg, E. Martin, “Introduction to the DICOM standard“, Eur Radiol, vol. 12, pp. 920-927, 2002. O. S. Pianykh, Digital Imaging and Communications in Medicine (DICOM): A Practical Introduction and Survival Guide. Leipzig, Germany: Springer Berlin Heidelberg, 2008, pp. 112-217. Digital Imaging and Communications in Medicine (DICOM): Part 5 – Data Structures and Encoding, National Electrical Manufacturers Association PS 3.5-2011. Digital Imaging and Communications in Medicine (DICOM): Part 6 – Data Dictionary, National Electrical Manufacturers Association PS 3.6-2011. Digital Imaging and Communications in Medicine (DICOM): Part 4 – Service Class Specifications, National Electrical Manufacturers Association PS 3.4-2011. The Apache Software Foundation. (2009). The Cassandra Project Website [Online]. Available: http://cassandra.apache.org/ A. Savaris, T. Härder, A. v. Wangenheim, “DCMDSM: a DICOM Decomposed Storage Model“, J Am Med Inform Assoc, doi: 10.1136/amiajnl-2013-002337. OFFIS. (2013). DCMTK – DICOM Toolkit Website [Online]. Available: http://dicom.offis.de/dcmtk.php.en The Apache Software Foundation. (2012). Apache Thrift Website [Online]. Available: http://thrift.apache.org/ OsiriX. OsiriX Imaging Software Website [Online]. Available: http://www.osirix-viewer.com/